<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>MCMedia RSS Feed - Systems</title>
<link>https://news.mcmedia.cam/feed/Systems</link>
<description>MCMedia News RSS Feed - Systems specific feed</description>
<docs>https://news.mcmedia.cam/rss-info.html</docs>
<generator>MCMedia RSS Generator v1.0</generator>
<lastBuildDate>Thu, 11 Jun 2026 19:30:33 +0000</lastBuildDate>
<atom:link href="https://news.mcmedia.cam/rss/rss_systems.xml" rel="self" type="application/rss+xml"/>
<item>
  <title>The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales</title>
  <link>https://arxiv.org/abs/2606.11371</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11371v1 Announce Type: cross Abstract: Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Designed-Source Reductions and a Dual-Purpose Feasibility Band for Semantic Rate-Distortion</title>
  <link>https://arxiv.org/abs/2606.11280</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11280v1 Announce Type: cross Abstract: The joint rate-distortion framework of Stavrou and Kountouris (IEEE Transactions on Communications 2023) characterises dual-fidelity tradeoffs for semantic communication on stochastic semantic sources. Many task-oriented communication systems instead use designed sources, where the semantic object is a deterministic oracle allocation $\phi^(t)$ rather than a stochastic quantity given by nature. We isolate the subclass of designed sources under smooth concave utility with assumptions A1, A2 and Euclidean allocation codomain, and restrict the encoder class to deterministic common-category mappings. Within this subclass the SK exponential-tilting decoder and generalised Blahut--Arimoto iteration specialise to conditional-mean decoding and Lloyd--Max stationarity on $\phi^(t)$. When the second fidelity is a monotone single-letter distortion, the joint problem stays inside the SK admissible class; the common-category SK rate is lower-bounded by the max of the corresponding Shannon rate-distortion functions, with equality only when the common-category reconstruction is compatible and RDF-optimal. When the second fidelity is aggregate verification, the joint problem leaves the SK single-letter class and admits a constrained-design feasibility band $R_{\min}(\varepsilon^) \leq R \leq R_{\max}(\beta^)$ of width $\log_2(K_{\max}/K_{\min})$ bits in partition cardinality. The reduction and the band are scope statements on the SK apparatus, not modifications to it. A smart-grid economic-dispatch example with a non-technical-loss-detection contrast illustrates the band.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bending the Rules of Propagation: Caustic Beamforming for Next-Generation Wireless Systems</title>
  <link>https://arxiv.org/abs/2606.12321</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12321v1 Announce Type: new Abstract: Conventional beamforming techniques primarily steer energy along desired directions or focus it at specific locations. These techniques become fragile when facing frequent blockage and highly dynamic propagation environments. In this article, we present caustic beamforming as a new paradigm for wireless beam control. First, we classify representative caustic beams according to their underlying mathematical origins and present three unique properties, namely self-bending, self-healing, and near-field non-diffracting. Building on these propagation properties, we then propose several application scenarios in sixth-generation (6G) networks. We undertake two case studies focused on physical layer security and service stability that highlight the capability of caustic beams to bypass potential eavesdroppers, deliver more uniform coverage, and sustain blockage-resilient links. We further discuss the enabling hardware architectures that facilitate practical deployments, and finally outline key open challenges regarding caustic beams that require further research.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Near Field Multi-Band Localization: CRB, Efficient Estimator, and Threshold SNR</title>
  <link>https://arxiv.org/abs/2606.12314</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12314v1 Announce Type: new Abstract: This paper presents a theoretical framework for multi-band localization for a single-path single-input multiple-output (SIMO) system. We derive closed-form Cramer-Rao bounds (CRBs) for angle-of-arrival (AoA) and distance for uniform linear arrays (ULAs), and an intermediate matrix-form formulation for arbitrary array shapes. We also develop benchmark single- and multi-band maximum-likelihood (ML) estimators for AoA-Distance, leveraging a structured Levenberg-Marquardt (LM) refinement procedure. A key contribution is an analytical characterization of the threshold SNR (TSNR) for the proposed estimators. This is the SNR threshold at which the estimator transitions from &quot;off the chart&quot; to CRB-approaching performance, for both TDoA and distance estimation. Numerical simulations confirm that the proposed single- and multi-band estimators achieve the CRB at SNRs above the predicted TSNR, and that multi-band processing simultaneously improves estimation accuracy and reduces SNR requirements. The resulting framework provides a rigorous foundation for next-generation multi-band localization and can be readily extended to elevation estimation, distributed arrays, and multi-path environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>LLM-Based Digital Twin Intelligence for Application-Aware Network Selection in 6G Heterogeneous Wireless Networks</title>
  <link>https://arxiv.org/abs/2606.12293</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12293v1 Announce Type: new Abstract: Future 6G heterogeneous wireless networks (HWNs) are expected to support multiple radio access technologies (RATs), dynamic wireless environments, and applications with diverse quality-of-service (QoS) requirements. In such environments, network selection (NS) cannot rely only on instantaneous radio measurements or static ranking rules. Instead, access decisions must account for the evolving wireless state, service intent, packet-level QoS behavior, and candidate-RAT dynamics. This paper proposes a large language model (LLM)-based digital twin (DT) framework for stable, application-aware RAT selection under candidate-set evolution. The main idea is to shift NS from an instantaneous decision-matrix operation to a decision process over an evolving wireless DT state. The constructed DT combines site-specific geometry, Sionna RT-based propagation descriptors, ns-3 packet-level QoS emulation, service context, candidate-RAT information, and decision memory. Rather than acting as a general-purpose controller for 6G networks, the LLM is used for DT-grounded decision intelligence in this specific NS task. On top of this DT, a unified intent agent translates user and service requirements into structured decision priorities for two complementary NS branches: an LLM-assisted multi-attribute decision-making branch (MADM--LLM--NS) and a direct LLM-based ranking branch (LLM--NS). To improve decision stability, the framework further introduces history-aware adaptive normalization (HAAN) and DT-memory-driven retrieval-augmented in-context learning (RA--ICL). Numerical results show that the proposed framework reduces rank-reversal problem and unnecessary handover events, while improving service-aware QoS satisfaction compared with representative MADM-based NS baselines.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Characterization of Speech Imagery in Scalp EEG and Comparison with Motor Imagery</title>
  <link>https://arxiv.org/abs/2606.12223</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12223v1 Announce Type: new Abstract: SSpeech imagery is attractive as a brain-computer interface paradigm for communication because it is endogenous and intrinsically linguistic. Yet despite growing interest, its dominant scalp-EEG spatiotemporal characteristics remain poorly characterized. Here, we asked how speech imagery appears in scalp EEG and compared it against finger motor imagery. Using a within-subject dataset containing speech imagery, finger motor imagery, and no-task trials recorded under the same trial structure, we analyzed band-power dynamics across channels and time. Finger motor imagery showed the expected contralateral mu/alpha and low-beta desynchronization over sensorimotor areas, whereas speech imagery showed a weaker, more distributed alpha-dominant increase. After normalization to each condition&#39;s own post-trial interval, the speech-related alpha increase changed only modestly after cue onset, indicating that much of the speech-versus-no-task difference was already present during the instruction period. A classifier discriminating imagery from no-task reached mean balanced accuracies of 0.563 $\pm$ 0.072 for speech imagery and 0.718 $\pm$ 0.127 for motor imagery, with a stronger alpha/beta dependence for motor imagery than for speech imagery. Together, these results provide a clearer group-level characterization of speech imagery in scalp EEG and indicate that its dominant spatiotemporal pattern differs from that of finger motor imagery and is more consistent with substantial non-articulatory task-related contributions than with a clear articulatory-motor analogue.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Deep Reinforcement Learning for Adaptive Power Allocation in ISAC Systems with Mobile Target</title>
  <link>https://arxiv.org/abs/2606.12078</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12078v1 Announce Type: new Abstract: In this paper, we study the power allocation for an integrated sensing and communication (ISAC) system which tracks a mobile target. We first model the problem as a Markov decision process, and then tackle it with a soft actor-critic (SAC) based deep reinforcement learning (DRL) approach. We also combine a Dirichlet policy, which naturally produces normalized continuous actions under random target motion. To exploit different features of sensing and communication operations, we carefully design a reward function such that the system can dynamically control power allocation to conserve resources. The simulation results demonstrate that the proposed scheme enhances tracking performance compared to other baselines while sustaining communication performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Unlocking the Potential of Movable Antennas: General and Practical Antenna Position Optimization</title>
  <link>https://arxiv.org/abs/2606.12024</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12024v1 Announce Type: new Abstract: Recently, movable antenna (MA) has attracted wide attention in wireless communications due to its potential in enhancing wireless communication performance via local movement within a confined region. However, antenna position optimization (APO) has emerged as a major challenge for MAs, due to the lack of a tractable, analytical, and accurate channel model in terms of antenna positions. Although existing works have developed various algorithms for APO, most of them are based on simplified theoretical channel models, which limit their generality. To address this challenge, in this article, we present more general and effective APO algorithms for different purposes, categorized as continuous APO and discrete APO, respectively. Continuous APO is mainly applied for flexible array signal processing to boost large-scale communication performance, while discrete APO is applied for small-scale multi-path channel reshaping. Specifically, the discrete APO discretizes the antenna movement region into multiple sampling points and employs discrete algorithms to determine the optimal MA positions based on the point-wise channel state information (CSI), without the need for an analytical channel model. To reduce the overhead for CSI acquisition, we also present more efficient learning-based APO algorithms that operate without requiring full point-wise CSI. Finally, we compare the application scenarios of the proposed algorithms and validate their effectiveness with numerical results.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Low-Density EEG for Seizure Detection: Evaluating CNN-RNN Architectures on a Behind-the-Ear Montage Setup</title>
  <link>https://arxiv.org/abs/2606.11970</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11970v1 Announce Type: new Abstract: Epilepsy affects over 50 million individuals globally, underscoring the need for automated seizure detection systems that can alleviate clinicians workload and enhance the accuracy of patient seizure diaries. In wearable EEG applications, however, reliable detection remains challenging due to the limited spatial resolution of low-density electrode configurations, reduced signal-to-noise ratios, and the scarcity of diverse, publicly available training datasets. This study investigates the efficacy of hybrid deep learning architectures for automated seizure detection using a simulated behind-the-ear montage derived from the Temple University Seizure Corpus (TUSZ, v2.0.3). We conduct a systematic comparison of several CNN-RNN models, including LSTM- and GRU-based variants, across multiple EEG montages to evaluate their capacity to compensate for the loss of spatial information inherent to reduced electrode configurations. The proposed CNN-Merged model, which integrates temporal and spectral feature representations, demonstrates superior performance, achieving a ROC AUC of 85.89% and a balanced accuracy of 79.11% on the held-out test set. Furthermore, the model exhibits strong robustness across different reference montages, effectively bridging the performance gap between conventional full-scalp recordings and resource-constrained wearable systems. These findings substantiate the potential of hybrid deep learning models as a promising avenue toward robust, patient-independent seizure detection in low-density EEG applications.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>NARRAS: Edge-Triggered Distributed Inference for CSI-Based Localization in Vehicular IoT Networks</title>
  <link>https://arxiv.org/abs/2606.11914</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11914v1 Announce Type: new Abstract: CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Efficiency Meets Reliability: Enhanced Generalized Interleaved Transform for Random Multiplexing</title>
  <link>https://arxiv.org/abs/2606.11890</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11890v1 Announce Type: new Abstract: To meet the demands of 6G wireless systems operating in high-mobility scenarios, this paper presents a design of a random multiplexing (RM) communication system that is both storage-efficient and highly reliable. In principle, RM with cross-domain memory approximate message passing (CD-MAMP) can achieve replica maximum a posteriori (MAP)-optimal performance by constructing a fully dense equivalent channel matrix. However, its practical implementation is hindered by the large storage overhead of conventional interleavers and by performance degradation in severely ill-conditioned channels, which existing related work (focusing on interleaving and transform designs) fails to address simultaneously. To overcome these issues, we develop a storage-efficient and highly reliable system that integrates RM with CD-MAMP, referred to as RM-MAMP. Specifically, we propose a Logistic chaotic mapping interleaver with a quantitative parameter-selection criterion, and a dual-stage high-order permutation polynomial interleaver, both of which achieve nearly identical bit-error-rate (BER) as fully random interleavers while reducing the interleaver storage from O(N) to O(1) and significantly lowering interleaver signaling overhead. We further propose a highly reliable interleaved transform framework, comprising an interleaved phase perturbation transform and a multi-layer interleaved coupled transform, to enhance the incoherence and diversity of the equivalent channel matrix. Simulation results show that the proposed storage-efficient interleavers maintain BER performance comparable to fully random interleavers, while the highly reliable transforms provide over 4 dB gain in severely time-varying channels, confirming the dual benefits of reduced storage overhead and improved robustness for the enhanced RM-MAMP system.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>On the Robustness of AFBM Sensing to Power Amplifier Nonlinearities</title>
  <link>https://arxiv.org/abs/2606.11879</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11879v1 Announce Type: new Abstract: We investigate the impact of power amplifier (PA) nonlinearities on the sensing performance of affine filter bank modulation (AFBM). While AFBM offers several advantageous properties for integrated sensing and communications (ISAC) - including reduced out-of-band emission (OOBE), low peak-to-average power ratio (PAPR), and natural robustness to doubly-dispersive (DD) channel effects - mitigating waveform distortion typically requires highly linear PAs. This creates a fundamental contradiction with ISAC applications, which demand high transmit power for reliable sensing. Our analytical results reveal that the structure of the effective AFBM modulation matrix dictates how distortion propagates within the ambiguity function (AF). Furthermore, simulations demonstrate that both the AF and the overall sensing performance of AFBM remain remarkably insensitive to such nonlinearities. These findings highlight the robustness of AFBM, making it a highly viable candidate for practical ISAC deployments constrained by hardware impairments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>REACH: Interpretability-Driven Feature Identification and Architecture Compression for Multi-Channel Vehicular Channel Estimation</title>
  <link>https://arxiv.org/abs/2606.11857</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11857v1 Announce Type: new Abstract: Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Parametric Channel Estimation with Hardware Impaired Hybrid Beamformers: Sensing, Communications, and Power Efficiency Tradeoffs</title>
  <link>https://arxiv.org/abs/2606.11829</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11829v1 Announce Type: new Abstract: Due to high power consumption and hardware costs of fully digital arrays, hybrid beamformers are often considered as a more economic alternative. Furthermore, using high resolution analog to digital converters (ADCs) can also have prohibitive power consumption, which leads to lower resolution converters being considered for radio frequency (RF) front end design. The finite quantization resolution as well as the nonlinearities caused by the power amplifiers (PAs) and low noise amplifiers (LNAs) can have a substantial impact on system performance. While widely studied for communications, the impact of hardware impairments on sensing performance is considerably less explored. In this work, we study the interplay between hybrid beamforming architectures, hardware impairments, and sensing and communications performance. Additionally, we define the concept of double-isotropy for pilot-combiner pairs, formalizing the notion of a perfectly energy-fair beam sweep. The multiple start (MS) space alternating generalized expectation maximization algorithm (SAGE) is also introduced, aimed at addressing the optimization issues arising from parametric channel estimation (PCE) in hybrid beamformed systems. We then provide a set of numerical results assessing the impacts of beamformer architecture and ADC resolution on PCE, sensing, and communications performance. The results show that medium resolution ADCs lead to the most power efficient configurations, with the best tradeoff between power consumption and performance for the majority of beamforming architectures. Additionally, fully digital beamforming architectures with high resolution converters can often be substituted for a hybrid beamformer setup with medium resolution converters without significant performance loss at a lower power consumption and overall hardware cost.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Quantization Limitations of Leakage Suppression in Self-Calibrating Monostatic Integrated Sensing and Communication MIMO Systems</title>
  <link>https://arxiv.org/abs/2606.11665</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11665v1 Announce Type: new Abstract: Power leaking directly from transmitting into receiving radio-frequency chains is a key challenge in the realization of monostatic sensing applications with multi-antenna communication front-ends, to which a promising solution is digitally precoding transmitted signals for improved leakage suppression. While digital transmit precodings perform well in theory, real-world deployments typically exhibit severely degraded leakage suppression. This work investigates quantization noise as a primary factor limiting the performance of such precoding schemes. A closed-form solution predicting the impact of quantization noise on the performance of arbitrary digital joint leakage estimation and leakage suppression precodings is derived, numerically analyzed, and validated in a hardware testbed.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Measurement-Based Analysis of Outdoor Massive MIMO Channel Characteristics over FR3 Frequency Band</title>
  <link>https://arxiv.org/abs/2606.11622</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11622v1 Announce Type: new Abstract: The Frequency Range 3 (FR3) band is attracting increasing attention due to limited lower-frequency spectrum and growing mobile communication demand. This study experimentally investigates channel characteristics in Urban Macro (UMa) scenarios at 8 GHz and 15 GHz using a large-scale MIMO platform with time-division multiplexing (TDM). Key parameters, including root mean square (RMS) delay spread (DS) and angular spread (AS), were extracted and compared with 3rd Generation Partnership Project (3GPP) TR 38.901. Results reveal clear frequency-dependent behaviors: RMS delay spread remains nearly constant under line of sight (LOS) but decreases from 8 GHz to 15 GHz in non-line of sight (NLOS), indicating reduced multipath dispersion at higher frequencies. Both azimuthal spreads (including ASA and ASD) and elevation spreads (including ESA and ESD) exhibit a corresponding decrease with increasing frequency, demonstrating a consistent trend towards more directional propagation across all angular domains. Capacity analysis indicates that the 15 GHz channel slightly outperforms 8 GHz in both LOS and NLOS scenarios due to more concentrated multipath energy and larger dominant singular values. Higher frequencies exhibit greater directionality, whereas lower frequencies provide broader multipath distributions and more stable performance, offering valuable guidance for multi-band MIMO modeling and 6G system design.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Antenna Coding and Digital Precoding for Limited Feedback MIMO Systems Using Pixel Antennas</title>
  <link>https://arxiv.org/abs/2606.11588</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11588v1 Announce Type: new Abstract: Pixel antennas enable antenna coding, a technique that can provide more degrees of freedom in wave manipulation, to enhance wireless communications. However, acquiring full channel state information (CSI) at the transmitter incurs prohibitive overhead due to the unique hardware constraints from pixel antennas. This paper thus proposes a limited feedback multi-input multi-output (MIMO) system using pixel antennas, where the antenna coder and digital precoder are designed based on pre-defined codebooks and efficient index feedbacks. We first derive the optimal digital precoder under practical power constraints that provides insights on simplifying the joint codebook construction for antenna coder and digital precoder. We then develop a low-complexity offline codebook construction algorithm that enables subsequent codebook designs for the antenna coder and digital precoder. Simulation results demonstrate that the proposed scheme significantly outperforms unconstrained MIMO systems using conventional antennas with fixed configurations.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Coherent Multiband OFDM Sensing via Low-Complexity Gap Reconstruction</title>
  <link>https://arxiv.org/abs/2606.11449</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11449v1 Announce Type: new Abstract: This paper investigates coherent multiband orthogonal frequency division multiplexing (OFDM) sensing within an integrated sensing and communication (ISAC) framework. We consider an intra-band configuration in which two sensing subbands of equal width are allocated symmetrically within the same OFDM channel, while the central portion remains available for communication. We address the reconstruction of missing frequency-domain samples induced by the spectral gap and the suppression of the resulting grating lobes in the delay profile. To this end, we propose a low-complexity iterative reconstruction method consisting of an initial delay-domain equalization stage and an iterative apodization-based operator with data-consistency enforcement. Performance results for multi-target scenarios show that the proposed approach remains close to the full-band reference for moderate gap sizes and degrades only for larger gaps because of residual grating lobes. Compared with the compressed-sensing-based orthogonal matching pursuit (OMP) baseline, it exhibits a more favorable performance trend as the number of targets increases, especially in the practically relevant low-signal-to-noise ratio (SNR) regime, while offering a complexity scaling that is independent of the estimated number of targets.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Additive Noise, Shift Recovery, and Signed Signals in the Cumulative Distribution Transform</title>
  <link>https://arxiv.org/abs/2606.11432</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11432v1 Announce Type: new Abstract: The cumulative distribution transform (CDT) is a quantile-based transport representation that exactly linearizes one-dimensional translations of positive densities. We study how this structure behaves under additive perturbations and how it can be exploited for shift recovery. Under a local nondegeneracy condition, we derive a first-order expansion showing that additive noise in physical space induces a nonlocal perturbation in CDT space through the primitive of the noise, weighted by the reciprocal density. This yields an explicit description of transform-domain sensitivity and shows, in particular, that perturbations are amplified in low-density regions. When the physical-space perturbation is modeled as a centered Gaussian random field, the induced first-order CDT perturbation is again Gaussian, with an explicit covariance kernel. We then use this structure to study recovery in CDT coordinates. In the known-template setting, the transport shift is obtained by projection onto the constant mode, giving an explicit estimator together with exactness in the noiseless case and a stability bound under perturbations. In the unknown-template setting, multiple observations permit joint recovery of the shifts and a common template up to the natural constant-mode gauge, leading to a simple de-shift--and--average procedure. We also consider a signed-signal analogue based on the signed cumulative distribution transform (SCDT), where shifts are estimated numerically by feature matching and unknown templates are recovered by alternating alignment and averaging. Numerical experiments validate the perturbation analysis and illustrate effective recovery for both density-valued and signed signals.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Beamforming Gain with Single-RF Movable Arrays</title>
  <link>https://arxiv.org/abs/2606.11342</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11342v1 Announce Type: new Abstract: A single-radio-frequency (RF) movable array is investigated, in which all movable elements are driven by a single RF chain with equal amplitude and equal phase. The achievable beamforming gain enabled by antenna placement is analyzed. Linear beamforming gain scaling with the number of antennas is shown to be achievable in single-path channels, while coherent-combining conditions and aperture requirements are established for multipath channels. For multiuser transmission, the optimal max-min power allocation is derived in closed form, based on which an element-wise coordinate-search algorithm is developed for antenna placement design. Numerical results validate the analysis and reveal a fundamental tradeoff: beamforming gains can be achieved through antenna placement alone, but only at the expense of increased aperture resources.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection</title>
  <link>https://arxiv.org/abs/2507.21164</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.21164v2 Announce Type: replace-cross Abstract: Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models</title>
  <link>https://arxiv.org/abs/2512.13765</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2512.13765v2 Announce Type: replace Abstract: The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Bridging the Modality Gap in Forensic Image Retrieval</title>
  <link>https://arxiv.org/abs/2606.12294</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12294v1 Announce Type: cross Abstract: Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography</title>
  <link>https://arxiv.org/abs/2606.12226</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12226v1 Announce Type: cross Abstract: While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field&#39;&#39; effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Non-frontal face recognition using GANs and memristor-based classifiers</title>
  <link>https://arxiv.org/abs/2606.12074</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12074v1 Announce Type: cross Abstract: Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>An Indoor Localization Technique Utilizing Passive Tags and 3-D Microwave Passive Radar Imaging</title>
  <link>https://arxiv.org/abs/2606.12123</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12123v1 Announce Type: new Abstract: A privacy-compliant indoor localization approach utilizing a 3-D near-field (NF) passive radar imaging technique is presented. This technique leverages ubiquitously radiated electromagnetic fields for imaging, with passive tags introduced to enhance the strength of scattering fields, thereby enabling precise localization at the imaging level. The method also supports localization in non-ideal imaging scenarios, such as for limited bandwidth or in highly-reflective environments. Based on their geometrical properties the simple and low-cost passive tags enable intuitive differentiation between individuals or objects. Associated privacy protection mechanisms are discussed, where the frequency-varying properties of the passive tags provide additional flexibility and potential applications under privacy and ethical considerations. Several forms of passive tags are presented, where both simulation and experimental results validate the effectiveness of the proposed passive tag designs.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI</title>
  <link>https://arxiv.org/abs/2606.11500</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11500v1 Announce Type: new Abstract: The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at https://github.com/OneMore1/FlexiBrain.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales</title>
  <link>https://arxiv.org/abs/2606.11371</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11371v1 Announce Type: cross Abstract: Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Designed-Source Reductions and a Dual-Purpose Feasibility Band for Semantic Rate-Distortion</title>
  <link>https://arxiv.org/abs/2606.11280</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11280v1 Announce Type: cross Abstract: The joint rate-distortion framework of Stavrou and Kountouris (IEEE Transactions on Communications 2023) characterises dual-fidelity tradeoffs for semantic communication on stochastic semantic sources. Many task-oriented communication systems instead use designed sources, where the semantic object is a deterministic oracle allocation $\phi^(t)$ rather than a stochastic quantity given by nature. We isolate the subclass of designed sources under smooth concave utility with assumptions A1, A2 and Euclidean allocation codomain, and restrict the encoder class to deterministic common-category mappings. Within this subclass the SK exponential-tilting decoder and generalised Blahut--Arimoto iteration specialise to conditional-mean decoding and Lloyd--Max stationarity on $\phi^(t)$. When the second fidelity is a monotone single-letter distortion, the joint problem stays inside the SK admissible class; the common-category SK rate is lower-bounded by the max of the corresponding Shannon rate-distortion functions, with equality only when the common-category reconstruction is compatible and RDF-optimal. When the second fidelity is aggregate verification, the joint problem leaves the SK single-letter class and admits a constrained-design feasibility band $R_{\min}(\varepsilon^) \leq R \leq R_{\max}(\beta^)$ of width $\log_2(K_{\max}/K_{\min})$ bits in partition cardinality. The reduction and the band are scope statements on the SK apparatus, not modifications to it. A smart-grid economic-dispatch example with a non-technical-loss-detection contrast illustrates the band.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bending the Rules of Propagation: Caustic Beamforming for Next-Generation Wireless Systems</title>
  <link>https://arxiv.org/abs/2606.12321</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12321v1 Announce Type: new Abstract: Conventional beamforming techniques primarily steer energy along desired directions or focus it at specific locations. These techniques become fragile when facing frequent blockage and highly dynamic propagation environments. In this article, we present caustic beamforming as a new paradigm for wireless beam control. First, we classify representative caustic beams according to their underlying mathematical origins and present three unique properties, namely self-bending, self-healing, and near-field non-diffracting. Building on these propagation properties, we then propose several application scenarios in sixth-generation (6G) networks. We undertake two case studies focused on physical layer security and service stability that highlight the capability of caustic beams to bypass potential eavesdroppers, deliver more uniform coverage, and sustain blockage-resilient links. We further discuss the enabling hardware architectures that facilitate practical deployments, and finally outline key open challenges regarding caustic beams that require further research.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Near Field Multi-Band Localization: CRB, Efficient Estimator, and Threshold SNR</title>
  <link>https://arxiv.org/abs/2606.12314</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12314v1 Announce Type: new Abstract: This paper presents a theoretical framework for multi-band localization for a single-path single-input multiple-output (SIMO) system. We derive closed-form Cramer-Rao bounds (CRBs) for angle-of-arrival (AoA) and distance for uniform linear arrays (ULAs), and an intermediate matrix-form formulation for arbitrary array shapes. We also develop benchmark single- and multi-band maximum-likelihood (ML) estimators for AoA-Distance, leveraging a structured Levenberg-Marquardt (LM) refinement procedure. A key contribution is an analytical characterization of the threshold SNR (TSNR) for the proposed estimators. This is the SNR threshold at which the estimator transitions from &quot;off the chart&quot; to CRB-approaching performance, for both TDoA and distance estimation. Numerical simulations confirm that the proposed single- and multi-band estimators achieve the CRB at SNRs above the predicted TSNR, and that multi-band processing simultaneously improves estimation accuracy and reduces SNR requirements. The resulting framework provides a rigorous foundation for next-generation multi-band localization and can be readily extended to elevation estimation, distributed arrays, and multi-path environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>LLM-Based Digital Twin Intelligence for Application-Aware Network Selection in 6G Heterogeneous Wireless Networks</title>
  <link>https://arxiv.org/abs/2606.12293</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12293v1 Announce Type: new Abstract: Future 6G heterogeneous wireless networks (HWNs) are expected to support multiple radio access technologies (RATs), dynamic wireless environments, and applications with diverse quality-of-service (QoS) requirements. In such environments, network selection (NS) cannot rely only on instantaneous radio measurements or static ranking rules. Instead, access decisions must account for the evolving wireless state, service intent, packet-level QoS behavior, and candidate-RAT dynamics. This paper proposes a large language model (LLM)-based digital twin (DT) framework for stable, application-aware RAT selection under candidate-set evolution. The main idea is to shift NS from an instantaneous decision-matrix operation to a decision process over an evolving wireless DT state. The constructed DT combines site-specific geometry, Sionna RT-based propagation descriptors, ns-3 packet-level QoS emulation, service context, candidate-RAT information, and decision memory. Rather than acting as a general-purpose controller for 6G networks, the LLM is used for DT-grounded decision intelligence in this specific NS task. On top of this DT, a unified intent agent translates user and service requirements into structured decision priorities for two complementary NS branches: an LLM-assisted multi-attribute decision-making branch (MADM--LLM--NS) and a direct LLM-based ranking branch (LLM--NS). To improve decision stability, the framework further introduces history-aware adaptive normalization (HAAN) and DT-memory-driven retrieval-augmented in-context learning (RA--ICL). Numerical results show that the proposed framework reduces rank-reversal problem and unnecessary handover events, while improving service-aware QoS satisfaction compared with representative MADM-based NS baselines.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Characterization of Speech Imagery in Scalp EEG and Comparison with Motor Imagery</title>
  <link>https://arxiv.org/abs/2606.12223</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12223v1 Announce Type: new Abstract: SSpeech imagery is attractive as a brain-computer interface paradigm for communication because it is endogenous and intrinsically linguistic. Yet despite growing interest, its dominant scalp-EEG spatiotemporal characteristics remain poorly characterized. Here, we asked how speech imagery appears in scalp EEG and compared it against finger motor imagery. Using a within-subject dataset containing speech imagery, finger motor imagery, and no-task trials recorded under the same trial structure, we analyzed band-power dynamics across channels and time. Finger motor imagery showed the expected contralateral mu/alpha and low-beta desynchronization over sensorimotor areas, whereas speech imagery showed a weaker, more distributed alpha-dominant increase. After normalization to each condition&#39;s own post-trial interval, the speech-related alpha increase changed only modestly after cue onset, indicating that much of the speech-versus-no-task difference was already present during the instruction period. A classifier discriminating imagery from no-task reached mean balanced accuracies of 0.563 $\pm$ 0.072 for speech imagery and 0.718 $\pm$ 0.127 for motor imagery, with a stronger alpha/beta dependence for motor imagery than for speech imagery. Together, these results provide a clearer group-level characterization of speech imagery in scalp EEG and indicate that its dominant spatiotemporal pattern differs from that of finger motor imagery and is more consistent with substantial non-articulatory task-related contributions than with a clear articulatory-motor analogue.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Deep Reinforcement Learning for Adaptive Power Allocation in ISAC Systems with Mobile Target</title>
  <link>https://arxiv.org/abs/2606.12078</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12078v1 Announce Type: new Abstract: In this paper, we study the power allocation for an integrated sensing and communication (ISAC) system which tracks a mobile target. We first model the problem as a Markov decision process, and then tackle it with a soft actor-critic (SAC) based deep reinforcement learning (DRL) approach. We also combine a Dirichlet policy, which naturally produces normalized continuous actions under random target motion. To exploit different features of sensing and communication operations, we carefully design a reward function such that the system can dynamically control power allocation to conserve resources. The simulation results demonstrate that the proposed scheme enhances tracking performance compared to other baselines while sustaining communication performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Unlocking the Potential of Movable Antennas: General and Practical Antenna Position Optimization</title>
  <link>https://arxiv.org/abs/2606.12024</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12024v1 Announce Type: new Abstract: Recently, movable antenna (MA) has attracted wide attention in wireless communications due to its potential in enhancing wireless communication performance via local movement within a confined region. However, antenna position optimization (APO) has emerged as a major challenge for MAs, due to the lack of a tractable, analytical, and accurate channel model in terms of antenna positions. Although existing works have developed various algorithms for APO, most of them are based on simplified theoretical channel models, which limit their generality. To address this challenge, in this article, we present more general and effective APO algorithms for different purposes, categorized as continuous APO and discrete APO, respectively. Continuous APO is mainly applied for flexible array signal processing to boost large-scale communication performance, while discrete APO is applied for small-scale multi-path channel reshaping. Specifically, the discrete APO discretizes the antenna movement region into multiple sampling points and employs discrete algorithms to determine the optimal MA positions based on the point-wise channel state information (CSI), without the need for an analytical channel model. To reduce the overhead for CSI acquisition, we also present more efficient learning-based APO algorithms that operate without requiring full point-wise CSI. Finally, we compare the application scenarios of the proposed algorithms and validate their effectiveness with numerical results.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Low-Density EEG for Seizure Detection: Evaluating CNN-RNN Architectures on a Behind-the-Ear Montage Setup</title>
  <link>https://arxiv.org/abs/2606.11970</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11970v1 Announce Type: new Abstract: Epilepsy affects over 50 million individuals globally, underscoring the need for automated seizure detection systems that can alleviate clinicians workload and enhance the accuracy of patient seizure diaries. In wearable EEG applications, however, reliable detection remains challenging due to the limited spatial resolution of low-density electrode configurations, reduced signal-to-noise ratios, and the scarcity of diverse, publicly available training datasets. This study investigates the efficacy of hybrid deep learning architectures for automated seizure detection using a simulated behind-the-ear montage derived from the Temple University Seizure Corpus (TUSZ, v2.0.3). We conduct a systematic comparison of several CNN-RNN models, including LSTM- and GRU-based variants, across multiple EEG montages to evaluate their capacity to compensate for the loss of spatial information inherent to reduced electrode configurations. The proposed CNN-Merged model, which integrates temporal and spectral feature representations, demonstrates superior performance, achieving a ROC AUC of 85.89% and a balanced accuracy of 79.11% on the held-out test set. Furthermore, the model exhibits strong robustness across different reference montages, effectively bridging the performance gap between conventional full-scalp recordings and resource-constrained wearable systems. These findings substantiate the potential of hybrid deep learning models as a promising avenue toward robust, patient-independent seizure detection in low-density EEG applications.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>NARRAS: Edge-Triggered Distributed Inference for CSI-Based Localization in Vehicular IoT Networks</title>
  <link>https://arxiv.org/abs/2606.11914</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11914v1 Announce Type: new Abstract: CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Efficiency Meets Reliability: Enhanced Generalized Interleaved Transform for Random Multiplexing</title>
  <link>https://arxiv.org/abs/2606.11890</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11890v1 Announce Type: new Abstract: To meet the demands of 6G wireless systems operating in high-mobility scenarios, this paper presents a design of a random multiplexing (RM) communication system that is both storage-efficient and highly reliable. In principle, RM with cross-domain memory approximate message passing (CD-MAMP) can achieve replica maximum a posteriori (MAP)-optimal performance by constructing a fully dense equivalent channel matrix. However, its practical implementation is hindered by the large storage overhead of conventional interleavers and by performance degradation in severely ill-conditioned channels, which existing related work (focusing on interleaving and transform designs) fails to address simultaneously. To overcome these issues, we develop a storage-efficient and highly reliable system that integrates RM with CD-MAMP, referred to as RM-MAMP. Specifically, we propose a Logistic chaotic mapping interleaver with a quantitative parameter-selection criterion, and a dual-stage high-order permutation polynomial interleaver, both of which achieve nearly identical bit-error-rate (BER) as fully random interleavers while reducing the interleaver storage from O(N) to O(1) and significantly lowering interleaver signaling overhead. We further propose a highly reliable interleaved transform framework, comprising an interleaved phase perturbation transform and a multi-layer interleaved coupled transform, to enhance the incoherence and diversity of the equivalent channel matrix. Simulation results show that the proposed storage-efficient interleavers maintain BER performance comparable to fully random interleavers, while the highly reliable transforms provide over 4 dB gain in severely time-varying channels, confirming the dual benefits of reduced storage overhead and improved robustness for the enhanced RM-MAMP system.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>On the Robustness of AFBM Sensing to Power Amplifier Nonlinearities</title>
  <link>https://arxiv.org/abs/2606.11879</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11879v1 Announce Type: new Abstract: We investigate the impact of power amplifier (PA) nonlinearities on the sensing performance of affine filter bank modulation (AFBM). While AFBM offers several advantageous properties for integrated sensing and communications (ISAC) - including reduced out-of-band emission (OOBE), low peak-to-average power ratio (PAPR), and natural robustness to doubly-dispersive (DD) channel effects - mitigating waveform distortion typically requires highly linear PAs. This creates a fundamental contradiction with ISAC applications, which demand high transmit power for reliable sensing. Our analytical results reveal that the structure of the effective AFBM modulation matrix dictates how distortion propagates within the ambiguity function (AF). Furthermore, simulations demonstrate that both the AF and the overall sensing performance of AFBM remain remarkably insensitive to such nonlinearities. These findings highlight the robustness of AFBM, making it a highly viable candidate for practical ISAC deployments constrained by hardware impairments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>REACH: Interpretability-Driven Feature Identification and Architecture Compression for Multi-Channel Vehicular Channel Estimation</title>
  <link>https://arxiv.org/abs/2606.11857</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11857v1 Announce Type: new Abstract: Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Parametric Channel Estimation with Hardware Impaired Hybrid Beamformers: Sensing, Communications, and Power Efficiency Tradeoffs</title>
  <link>https://arxiv.org/abs/2606.11829</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11829v1 Announce Type: new Abstract: Due to high power consumption and hardware costs of fully digital arrays, hybrid beamformers are often considered as a more economic alternative. Furthermore, using high resolution analog to digital converters (ADCs) can also have prohibitive power consumption, which leads to lower resolution converters being considered for radio frequency (RF) front end design. The finite quantization resolution as well as the nonlinearities caused by the power amplifiers (PAs) and low noise amplifiers (LNAs) can have a substantial impact on system performance. While widely studied for communications, the impact of hardware impairments on sensing performance is considerably less explored. In this work, we study the interplay between hybrid beamforming architectures, hardware impairments, and sensing and communications performance. Additionally, we define the concept of double-isotropy for pilot-combiner pairs, formalizing the notion of a perfectly energy-fair beam sweep. The multiple start (MS) space alternating generalized expectation maximization algorithm (SAGE) is also introduced, aimed at addressing the optimization issues arising from parametric channel estimation (PCE) in hybrid beamformed systems. We then provide a set of numerical results assessing the impacts of beamformer architecture and ADC resolution on PCE, sensing, and communications performance. The results show that medium resolution ADCs lead to the most power efficient configurations, with the best tradeoff between power consumption and performance for the majority of beamforming architectures. Additionally, fully digital beamforming architectures with high resolution converters can often be substituted for a hybrid beamformer setup with medium resolution converters without significant performance loss at a lower power consumption and overall hardware cost.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Quantization Limitations of Leakage Suppression in Self-Calibrating Monostatic Integrated Sensing and Communication MIMO Systems</title>
  <link>https://arxiv.org/abs/2606.11665</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11665v1 Announce Type: new Abstract: Power leaking directly from transmitting into receiving radio-frequency chains is a key challenge in the realization of monostatic sensing applications with multi-antenna communication front-ends, to which a promising solution is digitally precoding transmitted signals for improved leakage suppression. While digital transmit precodings perform well in theory, real-world deployments typically exhibit severely degraded leakage suppression. This work investigates quantization noise as a primary factor limiting the performance of such precoding schemes. A closed-form solution predicting the impact of quantization noise on the performance of arbitrary digital joint leakage estimation and leakage suppression precodings is derived, numerically analyzed, and validated in a hardware testbed.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Measurement-Based Analysis of Outdoor Massive MIMO Channel Characteristics over FR3 Frequency Band</title>
  <link>https://arxiv.org/abs/2606.11622</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11622v1 Announce Type: new Abstract: The Frequency Range 3 (FR3) band is attracting increasing attention due to limited lower-frequency spectrum and growing mobile communication demand. This study experimentally investigates channel characteristics in Urban Macro (UMa) scenarios at 8 GHz and 15 GHz using a large-scale MIMO platform with time-division multiplexing (TDM). Key parameters, including root mean square (RMS) delay spread (DS) and angular spread (AS), were extracted and compared with 3rd Generation Partnership Project (3GPP) TR 38.901. Results reveal clear frequency-dependent behaviors: RMS delay spread remains nearly constant under line of sight (LOS) but decreases from 8 GHz to 15 GHz in non-line of sight (NLOS), indicating reduced multipath dispersion at higher frequencies. Both azimuthal spreads (including ASA and ASD) and elevation spreads (including ESA and ESD) exhibit a corresponding decrease with increasing frequency, demonstrating a consistent trend towards more directional propagation across all angular domains. Capacity analysis indicates that the 15 GHz channel slightly outperforms 8 GHz in both LOS and NLOS scenarios due to more concentrated multipath energy and larger dominant singular values. Higher frequencies exhibit greater directionality, whereas lower frequencies provide broader multipath distributions and more stable performance, offering valuable guidance for multi-band MIMO modeling and 6G system design.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Antenna Coding and Digital Precoding for Limited Feedback MIMO Systems Using Pixel Antennas</title>
  <link>https://arxiv.org/abs/2606.11588</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11588v1 Announce Type: new Abstract: Pixel antennas enable antenna coding, a technique that can provide more degrees of freedom in wave manipulation, to enhance wireless communications. However, acquiring full channel state information (CSI) at the transmitter incurs prohibitive overhead due to the unique hardware constraints from pixel antennas. This paper thus proposes a limited feedback multi-input multi-output (MIMO) system using pixel antennas, where the antenna coder and digital precoder are designed based on pre-defined codebooks and efficient index feedbacks. We first derive the optimal digital precoder under practical power constraints that provides insights on simplifying the joint codebook construction for antenna coder and digital precoder. We then develop a low-complexity offline codebook construction algorithm that enables subsequent codebook designs for the antenna coder and digital precoder. Simulation results demonstrate that the proposed scheme significantly outperforms unconstrained MIMO systems using conventional antennas with fixed configurations.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Coherent Multiband OFDM Sensing via Low-Complexity Gap Reconstruction</title>
  <link>https://arxiv.org/abs/2606.11449</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11449v1 Announce Type: new Abstract: This paper investigates coherent multiband orthogonal frequency division multiplexing (OFDM) sensing within an integrated sensing and communication (ISAC) framework. We consider an intra-band configuration in which two sensing subbands of equal width are allocated symmetrically within the same OFDM channel, while the central portion remains available for communication. We address the reconstruction of missing frequency-domain samples induced by the spectral gap and the suppression of the resulting grating lobes in the delay profile. To this end, we propose a low-complexity iterative reconstruction method consisting of an initial delay-domain equalization stage and an iterative apodization-based operator with data-consistency enforcement. Performance results for multi-target scenarios show that the proposed approach remains close to the full-band reference for moderate gap sizes and degrades only for larger gaps because of residual grating lobes. Compared with the compressed-sensing-based orthogonal matching pursuit (OMP) baseline, it exhibits a more favorable performance trend as the number of targets increases, especially in the practically relevant low-signal-to-noise ratio (SNR) regime, while offering a complexity scaling that is independent of the estimated number of targets.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Additive Noise, Shift Recovery, and Signed Signals in the Cumulative Distribution Transform</title>
  <link>https://arxiv.org/abs/2606.11432</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11432v1 Announce Type: new Abstract: The cumulative distribution transform (CDT) is a quantile-based transport representation that exactly linearizes one-dimensional translations of positive densities. We study how this structure behaves under additive perturbations and how it can be exploited for shift recovery. Under a local nondegeneracy condition, we derive a first-order expansion showing that additive noise in physical space induces a nonlocal perturbation in CDT space through the primitive of the noise, weighted by the reciprocal density. This yields an explicit description of transform-domain sensitivity and shows, in particular, that perturbations are amplified in low-density regions. When the physical-space perturbation is modeled as a centered Gaussian random field, the induced first-order CDT perturbation is again Gaussian, with an explicit covariance kernel. We then use this structure to study recovery in CDT coordinates. In the known-template setting, the transport shift is obtained by projection onto the constant mode, giving an explicit estimator together with exactness in the noiseless case and a stability bound under perturbations. In the unknown-template setting, multiple observations permit joint recovery of the shifts and a common template up to the natural constant-mode gauge, leading to a simple de-shift--and--average procedure. We also consider a signed-signal analogue based on the signed cumulative distribution transform (SCDT), where shifts are estimated numerically by feature matching and unknown templates are recovered by alternating alignment and averaging. Numerical experiments validate the perturbation analysis and illustrate effective recovery for both density-valued and signed signals.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Beamforming Gain with Single-RF Movable Arrays</title>
  <link>https://arxiv.org/abs/2606.11342</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11342v1 Announce Type: new Abstract: A single-radio-frequency (RF) movable array is investigated, in which all movable elements are driven by a single RF chain with equal amplitude and equal phase. The achievable beamforming gain enabled by antenna placement is analyzed. Linear beamforming gain scaling with the number of antennas is shown to be achievable in single-path channels, while coherent-combining conditions and aperture requirements are established for multipath channels. For multiuser transmission, the optimal max-min power allocation is derived in closed form, based on which an element-wise coordinate-search algorithm is developed for antenna placement design. Numerical results validate the analysis and reveal a fundamental tradeoff: beamforming gains can be achieved through antenna placement alone, but only at the expense of increased aperture resources.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection</title>
  <link>https://arxiv.org/abs/2507.21164</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.21164v2 Announce Type: replace-cross Abstract: Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models</title>
  <link>https://arxiv.org/abs/2512.13765</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2512.13765v2 Announce Type: replace Abstract: The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Bridging the Modality Gap in Forensic Image Retrieval</title>
  <link>https://arxiv.org/abs/2606.12294</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12294v1 Announce Type: cross Abstract: Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography</title>
  <link>https://arxiv.org/abs/2606.12226</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12226v1 Announce Type: cross Abstract: While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field&#39;&#39; effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Non-frontal face recognition using GANs and memristor-based classifiers</title>
  <link>https://arxiv.org/abs/2606.12074</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12074v1 Announce Type: cross Abstract: Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>An Indoor Localization Technique Utilizing Passive Tags and 3-D Microwave Passive Radar Imaging</title>
  <link>https://arxiv.org/abs/2606.12123</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.12123v1 Announce Type: new Abstract: A privacy-compliant indoor localization approach utilizing a 3-D near-field (NF) passive radar imaging technique is presented. This technique leverages ubiquitously radiated electromagnetic fields for imaging, with passive tags introduced to enhance the strength of scattering fields, thereby enabling precise localization at the imaging level. The method also supports localization in non-ideal imaging scenarios, such as for limited bandwidth or in highly-reflective environments. Based on their geometrical properties the simple and low-cost passive tags enable intuitive differentiation between individuals or objects. Associated privacy protection mechanisms are discussed, where the frequency-varying properties of the passive tags provide additional flexibility and potential applications under privacy and ethical considerations. Several forms of passive tags are presented, where both simulation and experimental results validate the effectiveness of the proposed passive tag designs.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI</title>
  <link>https://arxiv.org/abs/2606.11500</link>
  <pubDate>Thu, 11 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11500v1 Announce Type: new Abstract: The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at https://github.com/OneMore1/FlexiBrain.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT</title>
  <link>https://arxiv.org/abs/2606.09953</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09953v1 Announce Type: new Abstract: Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Roc\&#39;io: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Laplace-Mixture Dipole Inversion for Quantitative Susceptibility Mapping</title>
  <link>https://arxiv.org/abs/2606.10240</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10240v1 Announce Type: new Abstract: Purpose: To develop an automatic dipole inversion method for quantitative susceptibility mapping (QSM) that preserves fine anatomical structures without the need for manual regularization-parameter tuning. Theory: The original approximate message passing with parameter estimation (AMP-PE) framework models image gradients with a single Laplace prior, which does not fully capture the heavy-tailed gradient distribution of brain susceptibility maps. This prior mismatch can lead to over-regularization and blocky reconstructions. We address this limitation by modeling the gradients with a two-component Laplace mixture prior. Methods: We propose a Laplace-Mixture Dipole Inversion (LAMDI) method by incorporating a two-component Laplace mixture prior into the AMP-PE framework with automatic parameter estimation. LAMDI was evaluated on a public in vivo dataset. Its performance was compared with FANSI, MEDI, and AMP-PE with a single-Laplace prior (AMP-PE-L1) under both standard default and reference-tuned settings. Results: On a public multi-orientation QSM dataset, LAMDI achieved NRMSE and SSIM comparable to AMP-PE-L1 while substantially reducing HFEN, suggesting improved preservation of high-frequency anatomical detail. Under reference-based tuning, FANSI and MEDI achieved the best performance for some metrics, but LAMDI remained competitive without requiring reference maps or manual regularization tuning. Conclusion: LAMDI provides an effective and automatic parameter-estimation alternative for QSM dipole inversion by combining competitive reconstruction accuracy with improved preservation of fine anatomical detail.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>POPSICLE: Benchmark Datasets for Segmentation and Localization in CryoET</title>
  <link>https://arxiv.org/abs/2606.10255</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10255v1 Announce Type: new Abstract: Cryo-electron tomography (cryoET) has emerged as a powerful tool in structural and cellular biology by enabling direct visualization of macromolecular structures within intact cells, thereby linking molecular architecture to cellular organization in a native context. Realizing the full potential of cryoET, however, increasingly depends on advances in computational analysis, particularly machine learning (ML), to interpret its complex and information-rich data. Despite rapid progress, ML development for cryoET remains bottlenecked by the lack of standardized, well-annotated benchmarks. Existing evaluations are typically small, task-specific, and are assembled in isolation, limiting robust comparisons across methods. Here, we present POPSICLE, a benchmark suite for cryoET segmentation and macromolecular localization built from the CryoET Data Portal - an open, ML-ready repository of tomographic data, metadata, and annotations. POPSICLE spans eukaryotic and prokaryotic systems, both purified and fully in situ samples, and dense voxel-wise segmentation as well as sparse localization tasks. Built on a living data resource, it can expand as new datasets and annotations become available. Baseline experiments reveal substantial variation in model rankings across tasks, underscoring the need for benchmarks tailored to the unique characteristics of cryoET rather than evaluation practices adapted from adjacent biomedical imaging domains. POPSICLE thus provides an open and extensible foundation for reproducible ML evaluation in cryoET.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Overlapped Wavelet Diffusion for Low-Light Image Enhancement</title>
  <link>https://arxiv.org/abs/2606.10280</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10280v1 Announce Type: new Abstract: In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unsupervised Deep Learning for Limited-Angle STEM-EDX Tomography -- Application to 3D Chemical Analysis of Phase-Change Memory Devices</title>
  <link>https://arxiv.org/abs/2606.10547</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10547v1 Announce Type: new Abstract: Energy Dispersive X-ray (EDX) tomography in Scanning Transmission Electron Microscopy (STEM) enables 3D compositional and elemental mapping at the nanoscale, but its use is limited by restricted tilt ranges and low-dose conditions required to avoid beam damage. Limited-angle acquisition introduces missing-wedge artefacts such as elongation and anisotropic resolution, while noisy low-dose data further degrade reconstruction quality and quantitative reliability. Here, we introduce an unsupervised deep learning framework based on Deep Image Prior with total variation regularization (DIP-TV) for limited-angle STEM-EDX tomography. We extend it to a multi-channel formulation (DIPm-TV) that jointly reconstructs multiple elemental maps by exploiting spatial correlations. Using a synthetic 3-channel phantom, we show that the method compensates for severe missing-wedge artefacts corresponding to approximately $100^\circ$ of missing angular range under moderate noise, outperforming simultaneous iterative reconstruction technique and compressed sensing approaches. We apply the method to 3D chemical analysis of Ge-Sb-Te (GST) memory devices in virgin (as-fabricated) and SET (crystalline) operational states. Samples were prepared as cross-sectional focused ion beam lamellae and acquired under a limited-angle tilt range from $-40^\circ$ to $+40^\circ$ with $5^\circ$ steps and a dose of $2.0\times10^5$ $e^-/Ang^2$. The multi-channel approach enables voxel-by-voxel elemental reconstruction using only EDX signals without external structural priors such as high-angle annular dark-field imaging. The reconstructed volumes show near-isotropic spatial resolution and reveal compositional heterogeneities associated with device operation. This approach enables 3D chemical characterization in experimentally accessible sample geometries where conventional methods fail due to severe angular limitations.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation</title>
  <link>https://arxiv.org/abs/2606.10713</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10713v1 Announce Type: new Abstract: The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Low-Dose 3D Bonding Mapping Through &quot;Soft&quot; Core-Loss EELS Tomography and Unsupervised Deep Learning</title>
  <link>https://arxiv.org/abs/2606.10893</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10893v1 Announce Type: new Abstract: Resolving the 3D chemical configuration of beam-sensitive nanomaterials at high spatial resolution remains a persistent frontier in scanning transmission electron microscopy (STEM). The main limitation lies in the trade-off between high electron dose required for analytical signals and the large number of projections needed for tomographic reconstruction. Here, we achieve dose-efficient 3D bonding mapping of FeO/Fe$_3$O$_4$ core-shell nanocubes with high resolution via electron energy loss spectroscopy (EELS). Our approach relies on two developments. First, a standardless &quot;soft&quot; core-loss EELS methodology exploiting Fe-M$_{2,3}$ edges provides ${\sim}50\times$ higher dose efficiency than conventional Fe-L$_{2,3}$ edges, using the latter only as a source of FeO and Fe$_3$O$_4$ standards. Second, we introduce multi-channel deep image prior with total variation regularization (DIPm-TV), an unsupervised method for spectroscopic tomography that jointly reconstructs multiple channels by exploiting spatial correlations under sparse-view and low-dose conditions. Using simulated datasets, high-quality reconstructions are obtained from as few as nine projections over $-70^\circ$ to $+70^\circ$, without HAADF-STEM signal or symmetry constraints. Applied to FeO/Fe$_3$O$_4$ nanocubes, Fe-M$_{2,3}$ EELS maps show improved SNR and spatial resolution, revealing a thin outer FeO shell surrounding the magnetite shell. DIPm-TV yields ${\sim}1$ nm isotropic resolution oxidation-state volumes preserving cubic morphology, recovering the outer FeO shell, and revealing a small internal void, features not accessible with conventional reconstruction methods. This work establishes a pathway for low-dose 2D and 3D analytical mapping of beam-sensitive materials using shallow core-loss edges, enabling orders-of-magnitude dose reduction while maintaining spectral fidelity and reliable 3D information.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Multimodal Brain Tumour Classification Using Feature Fusion</title>
  <link>https://arxiv.org/abs/2606.11107</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11107v1 Announce Type: new Abstract: Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Safecloud: A Distributed, Encrypted Storage Cloud for Streaming</title>
  <link>https://arxiv.org/abs/2606.09870</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09870v1 Announce Type: cross Abstract: We present Safecloud, a distributed, encrypted, self-pricing storage and streaming network whose storage and routing nodes never see plaintext and never hold keys. Each file is split into chunks, encrypted on the owner&#39;s device, and distributed across Drops (browser tabs storing ciphertext in IndexedDB) and Jets (federated routing servers). Only the owner, or an authorised grantee, can decrypt. We make five contributions: (1) A one-root key hierarchy: every key derives deterministically from a single root via HKDF, and owner and range-scoped grantee derive identical chunk keys (derivation agreement); a subtree key derives its range and nothing else (delegation containment). (2) Convergent content addressing: identical content yields identical ciphertext and identifiers, enabling deduplication without plaintext exposure, with identifiers binding authenticated ciphertext so a keyless Drop verifies integrity (blind verifiability). (3) Three parallel trees over one navigation path (Merkle for integrity, key-derivation for confidentiality, access for authorisation), with sound Merkle-verified retrieval. (4) The key tree doubles as a streaming index: a player derives each segment key in O(1), seeking by derivation, while parallel tracks (video, audio, captions) are independent subtrees unlockable per-track and per-segment, a combination we believe no prior encrypted-storage network offers. (5) Jets and Drops earn Safebux verifiably, kept honest by a one-signature proof-of-storage challenge under chilling-effect Proof-of-Corruption, a zero-sum economy that is significantly cheaper than Filecoin&#39;s proof-of-replication sealing (which is slow and provides no confidentiality). We give the architecture, cryptographic construction, a threat model, and an open-source reference implementation, stating precisely what is implemented versus designed.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images</title>
  <link>https://arxiv.org/abs/2501.01481</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2501.01481v2 Announce Type: replace Abstract: Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Cyst-X: A Multi-Center MRI Benchmark and Federated Learning Framework for Malignancy-Risk Stratification of Pancreatic Cystic Neoplasm</title>
  <link>https://arxiv.org/abs/2507.22017</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.22017v4 Announce Type: replace Abstract: Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we introduce Cyst-X, a multi-center MRI benchmark and a federated learning framework for IPMN malignancy-risk stratification. The dataset comprises 1,461 abdominal MRI scans from 764 patients at seven international centers, with three-tier malignancy labels anchored in histopathology or three-year imaging follow-up and expert pancreas segmentations. The pipeline couples the PanSegNet pancreas segmenter with a 3D DenseNet-121 classifier and a parallel radiomics predictor. On internal cross-validation, the deep learning classifier reached a mean area under the receiver operating characteristic curve (AUC) of 0.85 (95% confidence interval 0.84-0.86) on T2-weighted MRI for high-risk versus low- or no-risk discrimination, with the average precision rising from a prevalence baseline of 0.23 to 0.64. This performance was preserved (AUC 0.85, FedProx) when training was distributed across institutions without exchange of raw patient images. Benchmarked against three blinded radiologists on a 629-case reader subset evaluated under imaging-only conditions, the classifier matched or exceeded sensitivity at comparable specificity. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset, segmentation masks, and trained models as the first large-scale, multi-centre MRI resource for pancreatic cystic neoplasm analysis.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor</title>
  <link>https://arxiv.org/abs/2511.19706</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2511.19706v2 Announce Type: replace Abstract: Rotation invariance is a fundamental requirement across many computer vision tasks. Historically, this inductive bias has been encoded through hand-crafted rotation-invariant representations. These are compact, interpretable, and fast to compute, but they come at the cost of descriptive power. More recently, architectures achieve inductive bias through learned representations. These are highly descriptive and achieve strong empirical performance, at the cost of efficiency and interpretability. In this work, we propose an alternative at the intersection of both paradigms. We introduce the selective disk bispectrum (SDB), a complex-valued rotation-invariant vector that preserves all information about the image except its orientation. Our key theoretical contributions are the selective disk bispectrum, its inversion, its (reduced) spatial and computational complexities (compared to the full disk bispectrum), and its expectation and variance under noise. Furthermore, we propose a numerical SDB approximation and provide theoretical guarantees for its accuracy and rotation invariance. Empirically, we validate SDB&#39;s invariance and robustness to noise classification tasks. We test our reconstruction algorithm on multi-reference alignment of rotated images.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Real-time 3D Visualization of Radiance Fields on Light Field Displays</title>
  <link>https://arxiv.org/abs/2508.18540</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.18540v2 Announce Type: replace-cross Abstract: Radiance fields, including their recent efficient forms such as 3D Gaussian Splatting and Sparse Voxels, have revolutionized photorealistic 3D scene visualization by enabling high-fidelity reconstruction of complex environments, making them a natural match for light field displays. However, integrating these technologies presents significant computational challenges, as light field displays require many high-resolution renderings from slightly shifted viewpoints, while radiance fields rely on computationally intensive volume rendering, which is intractable to achieve real-time speeds even with efficient scene representations. In this paper, we propose a unified and efficient framework for real-time radiance field rendering on light field displays. Rather than re-rendering each view independently, our method converts the input radiance field into shared intermediate sweeping planes that can be efficiently composited into dense light-field views in a single pass. Our method prioritizes shared, non-directional plane caching for real-time performance, trading fine view-dependent color effects for a modest increase in intermediate memory usage. Our framework generalizes across different scene representations without retraining and avoids repeated computation across views. We further demonstrate a real-time interactive application on a Looking Glass display, achieving 200+ FPS at 512p across 45 rendered views and enabling seamless, immersive 3D interactive viewing experiences. On standard benchmarks, our method achieves up to 22x speedup compared to independently rendering each view, while largely preserving image quality.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Curved Beam Enabled Wireless Communications: Modeling, Analysis and Optimization</title>
  <link>https://arxiv.org/abs/2606.10164</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10164v1 Announce Type: new Abstract: In this paper, the problem of using curved beams to improve wireless communication performance in the presence of a blockage is studied. In particular, a transmitter equipped with a continuous aperture array can generate curved beams to serve multiple receivers by allowing signals to propagate along both straight and curved paths. To optimize the weighted sum-rate, a curved beam model is developed for controlling the beam steering, beam focusing, and beam curving functions, along with a segmented channel model to characterize practical channels induced by the blockage. Based on the introduced curved beam model, an optimization problem is posed with the goal of maximizing the weighted sum-rate of all users under a transmit power budget and physical constraints of curved beams. To solve this problem, the continuous aperture is first converted into finite summations via a discrete sampling of the continuous coordinate. Then, the performance gap between the ideal continuous aperture design and its practical discrete aperture approximation is analyzed. Based on the above discrete approximation, an iterative algorithm is developed to optimize curved beam control parameters. In particular, the original problem is reformulated as a trackable form via fractional programming (FP). Then, the transformed problem is solved by designing an enhanced block coordinate ascent (BCA) method which determines a surrogate-construction point leveraging the local descent from previous iterations, thereby accelerating convergence. Then, a proximal regularization term is included into the surrogate function to control the update magnitude and suppress aggressive update, thereby improving updates stability. Finally, the beam amplitudes are computed based on the effective channel gains. Simulation results show that the proposed method can improve the weighted sum-rate compared to using only straight beam.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Optimal Illumination via Joint Movement and Phase Optimization for Movable Antenna-RIS Configuration</title>
  <link>https://arxiv.org/abs/2606.10190</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10190v1 Announce Type: new Abstract: Reconfigurable intelligent surfaces (RIS) enable programmable control of wireless propagation but remain vulnerable to persistent deep fades in static deployments. This paper introduces a Movable Antenna-enhanced RIS (MA-RIS) architecture where antenna elements physically reposition to sample independent spatial channels, enabling mobility-induced diversity. We model antenna motion using a Stochastic Differential Equation (SDE) framework capturing controlled drift and environmental diffusion. It^o calculus-based analysis characterizes steady-state antenna distributions, spatial decorrelation, and outage probability, revealing fundamental trade-offs between control strength and mobility randomness. To maximize long-term SNR while accounting for control overhead, we propose an overhead-aware Two-timescale framework separating slow antenna trajectory control from fast phase adaptation. The stochastic optimal control problem is solved via predictive approximation of the Hamilton-Jacobi-Bellman (HJB) formulation, enabling real-time implementation. Simulations validate theoretical predictions: the Two-timescale strategy achieves up to 36 dB steady-state SNR with remarkable stability, outperforming position-only control by up to 15 dB and uncontrolled baselines by over 30 dB. Despite experiencing a lower SNR than Active RIS, the proposed approach delivers up to 16 times higher energy efficiency (EE) across varying system scales, establishing a new paradigm of mobility-enabled channel adaptation for resilient wireless systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Simplified Temporal Convolutional-Based Channel Estimation for a WiFi Vehicular Communication Channel</title>
  <link>https://arxiv.org/abs/2606.10511</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10511v1 Announce Type: new Abstract: Channel estimation in vehicular communication is a crucial element in the advancement of intelligent transportation systems. However, the use of pilot signals in the IEEE 802.11p standard is insufficient for accurate channel estimation in high-mobility scenarios. Data pilot-aided (DPA) estimation helps address this, but suffers from demapping errors. We propose a simplified Temporal Convolutional Network-based estimator (DPA-TCN) trained on a mixed signal-to-noise ratio dataset to improve estimation performance and reduce computational complexity. Our DPA-TCN estimator achieves a bit error rate comparable to a state-of-the-art long-short-term memory network with DPA and temporal averaging (LSTM-DPA-TA) while reducing the complexity of the model by approximately 65%.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Complex VAE with Heavy-Tailed Likelihood for Radar Target Detection in Sea Clutter</title>
  <link>https://arxiv.org/abs/2606.10540</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10540v1 Announce Type: new Abstract: To address the heavy-tailed, spike-prone nature of sea clutter and the scarcity of labeled target data, an unsupervised complex-valued variational autoencoder (VAE) for maritime radar target detection is proposed. In implementation, each complex baseband slow-time sequence is represented by its in-phase and quadrature components, and the model learns their joint reconstruction from clutter-only data. A Student-\(t\) negative log-likelihood is adopted to capture heavy-tailed reconstruction errors while reducing sensitivity to outliers during clutter learning. In addition, a time-domain amplitude error constraint is introduced to penalize slow-time magnitude mismatch in the reconstruction. At inference, reconstruction deviation is used as the detection statistic, and the decision threshold is set via an empirical quantile estimated from a clutter-only validation set to enforce a constant false-alarm rate (CFAR). Experiments on measured sea-clutter data show that detection performance is consistently improved over MF, AMF, and a real-valued \(\beta\)-VAE under CFAR constraints.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Information Bottleneck Meets Quantization: Finite Rate Analysis and Optimal Designs</title>
  <link>https://arxiv.org/abs/2606.10869</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10869v1 Announce Type: new Abstract: The Information Bottleneck (IB) is a well established framework that looks for a latent compact representation of a data source, by trading rate and data-size representation, for information accuracy with respect to another target data. The Gaussian IB (GIB) is its simple closed form solution, when the target is jointly Gaussian with the source. Actually, in many practical problems the latent representation has to be stored or represented by a finite number of bits, while the optimal (G)IB solution has not. First, this manuscript theoretically analyzes the effect of scalar and vector quantization of the GIB latent representation, and its impact on the (dis)informativeness with respect to the target data. Then, task-oriented quantization designs are proposed by (jointly) reformulating the GIB optimization problem under a finite-rate constraint on the latent representation. Simulation results on MMSE regression problems confirm the effectiveness of the proposed quantization designs, which show significant gains with respect to more heuristic, or separate, quantization designs of the standard GIB latent representation. Finally, the paper extends the task-oriented philosophy to non-Gaussian settings, by properly modifying the cost function used in variational auto-encoders (VAEs) of IB-inspired vector quantizers.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Personalized Deep Learning for Short-Term Forecasting of Impending Atrial Fibrillation from Continuous Wearable ECG Signals</title>
  <link>https://arxiv.org/abs/2606.10900</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10900v1 Announce Type: new Abstract: Background and Objective: Continuous wearable electrocardiogram (ECG) monitoring is increasingly used for ambulatory arrhythmia surveillance, yet forecasting impending atrial fibrillation (AF) is challenged by inter-patient ECG variability. This study investigated whether personalizing a global model via fine-tuning on an individual&#39;s ECG signals improves short-term forecasting of impending AF. Methods: A global model trained on the ICENTIA11K dataset was compared against personalized models fine-tuned across three cohorts: ICENTIA11K, IRIDIA-AF, and MobiCARE. Following preprocessing, models processed 60-second ECG segments for a five-minute forecast horizon. We evaluated the impact of adaptation data volume and analyzed ECG features, such as heart rate and RMSSD. Results: Personalized models significantly outperformed the global model, achieving AUROCs of 0.711 vs. 0.614 in ICENTIA11K and 0.686 vs. 0.585 in MobiCARE. Personalization benefits increased with the amount of patient-specific fine-tuning data. While the global model&#39;s accuracy rose as AF onset approached, personalized models in the two external cohorts exhibited distinct temporal dynamics, which may indicate the capture of patient-specific characteristics less dependent on proximity to the AF event. Pre-AF episodes showed elevated heart rates and RMSSD. Feature attributions highlighted clinically relevant precursors, including frequent premature atrial complexes (PACs) and short supraventricular tachycardias (SVTs). Conclusions: Adapting deep learning models with patient-specific wearable ECG data significantly enhances short-term forecasting of impending AF. This personalized framework supports timely preventive interventions and improved AF management in ambulatory monitoring environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Multi-Channel Soil Moisture Measurement: High Accuracy and Low Crosstalk Through Optical-Semiconductor Based Differential Sensing</title>
  <link>https://arxiv.org/abs/2606.11020</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11020v1 Announce Type: new Abstract: Soil moisture measurement plays a key role in irrigation and environmental management. Yet it remains unreliable due to heterogeneous soils, limited sensing volumes, temperature drift, and parasitic inter-channel coupling. This work presents a compact multi-depth capacitive probe that extends a parallel-plate geometry from previous work with differential activation to suppress stray capacitances and improve accuracy. An equivalent-circuit model quantifies parasitic effects, and optically coupled transistor bridges isolate each sensing layer. Raw capacitance is converted to volumetric water content and plant-available water using established calibration models. Laboratory results show a fourfold reduction in temperature sensitivity, strong confinement of the sensing volume, and improved repeatability in heterogeneous soils. Field validation against reference sensors demonstrates high accuracy and precision comparable to widely used instruments, enabling a practical and scalable solution for agricultural and urban soil-moisture monitoring.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals</title>
  <link>https://arxiv.org/abs/2606.11125</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11125v1 Announce Type: new Abstract: Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Pre-Fault Voltage Discrimination and Time-Domain Protection for Distribution Networks with Inverter-Based Resources</title>
  <link>https://arxiv.org/abs/2606.11135</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11135v1 Announce Type: new Abstract: The increasing proliferation of inverter-based resources (IBRs) in distribution networks is presenting a major challenge for phasor-based overcurrent protection. This challenge stems from IBRs&#39; lack of short-circuit current sourcing capacity. As a result, traditional overcurrent protection functions (e.g., ANSI 51) are inadequate in such scenarios, and warrant alternative approaches. Time-domain protection, for example, shows promise in overcoming this challenge. In this paper we propose a pre-fault voltage discrimination (PVD) strategy whose role is to detect faults and discriminate normal switching and transformer inrush disturbances from actual faults. The use of PVD allows for the design of a simple, yet effective fault detection algorithm by using time-domain protection principles for distribution networks containing IBRs. The introduction of PVD provides for faster fault detection without reducing security and dependability. Offline simulation experiments and controller hardware-in-the-loop real-time simulation validate the effectiveness of the proposed algorithm against various fault and normal switching events.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Structured Adaptive Tensor Prediction for Streaming Data</title>
  <link>https://arxiv.org/abs/2606.10085</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10085v1 Announce Type: cross Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and time-varying environments. Adaptive filtering techniques have also been largely limited to data with scalar or vector values, leaving adaptive forecasting for matrix-valued time series inadequately understood. To bridge these gaps, we develop an adaptive tensor regression framework that includes Matrix-on-Matrix (MoM) and Tensor-on-Matrix (ToM) formulations for streaming matrix-valued prediction. The two formulations differ in whether to directly model matrix-valued outputs or to exploit temporal structure via higher-order tensor representations. For the proposed tensor regression framework, we develop stochastic gradient descent (SGD) algorithms for online learning. We show that stacking multiple responses across time into higher-order tensors improves performance; in particular, the ToM achieves lower steady-state error and stronger denoising capability than MoM, motivating our focus on the ToM model. We further characterize the tracking behavior of SGD under time-varying dynamics. From a statistical perspective, we establish fixed-time recovery guarantees for ToM under general low-dimensional structures, including sparsity, low-rankness, and their joint sparselow-rank models.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection</title>
  <link>https://arxiv.org/abs/2606.10410</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10410v1 Announce Type: cross Abstract: Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Learning Doubly Sparse Explicitly Conditioned Transforms</title>
  <link>https://arxiv.org/abs/2606.10975</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10975v1 Announce Type: cross Abstract: Finding convenient spaces in which certain hypotheses regarding an assumed sparse structure of natural signals hold true has become a desirable result in recent research, its implications being reflected in areas such as data compression, noise reduction and feature extraction. While the extensively used analytical transforms, such as DFT or DCT, already provide efficient algorithms and robust sparse representations, they assume a fixed prior about the data, failing to accurately capture the specific structure of more restrictive classes of signals. To address this, the concept of a data-adaptive, learnt transform has been introduced in the literature, allowing for the reduction of a residual term in the transform domain. More recent studies have shown that the condition number serves as a good metric in this context, where the desired outcome alternates between a generalizing tendency and one that achieves minimal approximation error. Motivated by these considerations, we introduce the learning of a structured, explicitly conditioned transform formulated as the product of a fixed canonical matrix and a refining data-adaptive sparse component. This approach seeks to preserve the advantages of fast and stable analytical transforms, while introducing controllable adaptivity to the data. No references that concern this specific formulation have been identified so far, indicating its novelty. The proposed algorithm is motivated within the framework of inexact proximal methods, leveraging a newly derived closed-form projection operator. Empirical observations demonstrate state-of-the-art results on the doubly sparse transform learning problem and comparable performance with its dense variant at significantly lower computational costs and sometimes faster convergence and better avoidance of bad local minima.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Federated Learning Enhanced by Feature Reconstruction for Semantic Communication Module Updates of Agents</title>
  <link>https://arxiv.org/abs/2508.03248</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.03248v3 Announce Type: replace Abstract: Recent advancements in semantic communication have primarily focused on image transmission, where neural network-based joint source-channel coding modules play a central role. However, such systems often experience semantic communication errors due to mismatched knowledge bases between agents and performance degradation from outdated models, necessitating regular model updates. To address these challenges in vector quantization (VQ)-based image semantic communication systems, we propose FedSFR, a novel federated learning framework that incorporates semantic feature reconstruction (FR). FedSFR introduces an FR step at the parameter server and allows a subset of clients to transmit compact feature vectors in lieu of sending full local model updates, thereby improving training stability and communication efficiency. To enable effective FR learning, we design a loss function tailored for VQ-based image semantic communication and demonstrate its validity as a surrogate for image reconstruction error. We further establish a rigorous convergence analysis of FedSFR. Experimental results on two benchmark datasets validate the superiority of FedSFR over existing baselines, especially in capacity-constrained settings, confirming both its effectiveness and robustness.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Delay-Doppler Domain Channel Measurements and Modeling in High-Speed Railways</title>
  <link>https://arxiv.org/abs/2509.25854</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.25854v2 Announce Type: replace Abstract: As next-generation wireless communication systems need to be able to operate in high-frequency bands and high-mobility scenarios, delay-Doppler (DD) domain multicarrier (DDMC) modulation schemes, such as orthogonal time frequency space (OTFS), demonstrate superior reliability over orthogonal frequency division multiplexing (OFDM). Accurate DD domain channel modeling is essential for DDMC system design. However, since traditional channel modeling approaches are mainly confined to time, frequency, and space domains, the principles of DD domain channel modeling remain poorly studied. To address this issue, we propose a systematic DD domain channel measurement and modeling methodology in high-speed railway (HSR) scenarios. First, we design a DD domain channel measurement method based on the long-term evolution for railway (LTE-R) system. Second, for DD domain channel modeling, we investigate quasi-stationary interval, statistical power modeling of multipath components, and particularly, the quasi-invariant intervals of DD domain channel fading coefficients. Third, via LTE-R measurements at 371 km/h, taking the quasi-stationary interval as the decision criterion, we establish DD domain channel models under different channel time-varying conditions in HSR scenarios. Fourth, the accuracy of proposed DD domain channel models is validated via bit error rate comparison of OTFS transmission. In addition, simulation verifies that in HSR scenario, the quasi-invariant interval of DD domain channel fading coefficient is on millisecond (ms) order of magnitude, which is much smaller than the quasi-stationary interval length on 100 ms order of magnitude. This study could provide theoretical guidance for DD domain modeling in high-mobility environments, supporting future DDMC and integrated sensing and communication designs for 6G and beyond.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Gridless Full-Space DOA Estimation for STAR-RIS-Assisted Wireless Systems</title>
  <link>https://arxiv.org/abs/2602.02893</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2602.02893v2 Announce Type: replace Abstract: Simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) enable full-space ($0^\circ$--$360^\circ$) signal coverage, making them a compelling platform for integrated sensing and communication in next-generation wireless networks. In this paper, we investigate gridless direction-of-arrival (DOA) estimation across the full spatial domain in STAR-RIS-assisted systems operating with a single RF sensing chain. We show that the coupled reflection-transmission mechanism of STAR-RIS induces a multichannel finite-rate-of-innovation (FRI) structure in the received signal, which enables casting DOA estimation as a structured low-rank recovery problem without angular grid discretization. Building on this observation, we develop a proximal gradient descent algorithm with alternating projections onto a block-Hankel matrix set, enabling robust angle retrieval from limited measurements. Two practically relevant STAR-RIS configurations are addressed: element-wise uniform and nonuniform energy-splitting designs, each handled through a dedicated lifting strategy that preserves the underlying algebraic structure. A Ziv-Zakai bound is derived for the coupled full-space sensing model as a performance benchmark across the full SNR range. Numerical results show that the proposed methods consistently outperform grid-based baselines, achieving sub-degree accuracy within $\pm 60^\circ$ of boresight at comparable or lower computational cost.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Towards 6G Single-Anchor Vehicle Localization Exploiting Radio-Reflective Road Markings in Tunnel Environments</title>
  <link>https://arxiv.org/abs/2604.04217</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.04217v2 Announce Type: replace Abstract: Accurate vehicular localization remains a key challenge for cooperative intelligent transport systems (C-ITS), especially in areas without global navigation satellite system (GNSS) coverage, such as road tunnels. This paper proposes a novel vehicle positioning method with a single anchor equipped with multiple antennas, exploiting near-field (NF) propagation and passive radio-reflective structures deployed along the GNSS-denied tunnel. The method assumes a wideband vehicle-to-everything (V2X) communication between the vehicle and the anchor, in line with the undergoing standardization of cellular V2X beyond 5G. We first derive the validity condition that allows us to approximate the multipath channel with a single reflector point, defining a geometry validity bound on the number of antennas that can be employed. Building on this result, we propose JAVELIN, a 6G-compatible single-anchor localization framework that leverages tensor-based NF parameter estimation, adaptive NF/far-field (FF) processing, and recursive Bayesian tracking to enable sub-meter positioning without multi-anchor synchronization. The method integrates angle, delay difference, and curvature measurements into a variable-dimension extended Kalman filter with gated nearest-neighbor association, enabling operation without prior environmental knowledge. Radio-reflective road markings are further introduced to enhance geometric diversity. Simulation results in realistic tunnel scenarios demonstrate accurate and robust localization under different conditions, outperforming state-of-the-art single-anchor approaches and benefiting from passive reflector deployment</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Stability Analysis for Autoregressive Sampling Sets</title>
  <link>https://arxiv.org/abs/2606.03942</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03942v2 Announce Type: replace Abstract: Motivated by recent developments in stochastic modeling of clock jitter in Analog-to-Digital Converters (ADCs) as autoregressive processes of order one (AR(1)), we study the density and stability properties of AR(1)-jittered sampling sets for Paley-Wiener signals. We show that, despite having the correct asymptotic density both on average and almost surely, such sets almost surely fail to be stable sampling sets. We complement this negative result with a finite-dimensional analysis, showing that the corresponding jittered sinc matrices are nonetheless well-conditioned with high probability.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Symmetry-Aware Convex Shrinkage for High-Dimensional Covariance Estimation</title>
  <link>https://arxiv.org/abs/2605.17111</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.17111v2 Announce Type: replace-cross Abstract: We develop a class of data-adaptive shrinkage estimators for high-dimensional covariance estimation in which the shrinkage target is a Reynolds projection of the sample covariance under a finite symmetry group selected from a candidate library by held-out predictive performance. The class generalizes the convex shrinkage estimator of Ledoit and Wolf by replacing the scalar-identity target with a structured target derived from a symmetry group when one is available, and generalizes the group-symmetric maximum-likelihood estimator of Shah and Chandrasekaran by combining structural targeting with adaptive convex shrinkage and by selecting the group from data rather than treating it as prespecified. A two-tier procedure performs the group selection: a universal per-candidate evaluation based on held-out negative log-likelihood, optionally preceded by a domain-specific step that constructs the candidate library from structural priors. We establish a finite-sample regret bound for the held-out calibration of the convex combination weight, an oracle inequality for the data-driven group selection, and a quantitative sufficient-match condition under which the proposed estimator dominates Ledoit-Wolf shrinkage in Frobenius mean-squared error. The procedure is illustrated on six real-data problems spanning finance (S&amp;P~500 daily returns), climate (NOAA OISST sea-surface temperature anomalies), genomics (TCGA-BRCA gene expression), radio signal processing (RadioML 2018.A), astronomical imaging (Galaxy10 DECaLS), and natural image patches (CIFAR-10 with a CIFAR-10.1 distribution-shift companion). An empirical comparison is also made against the Bayesian permutation-symmetry estimator of Chojecki and colleagues. Outside the few-shot regime, where structural priors carry the most information per observation, Ledoit-Wolf shrinkage remains the appropriate baseline.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Lumped RC Equivalent Circuit of Head Tissues for Dispersive Neuro-Electromagnetic Modeling</title>
  <link>https://arxiv.org/abs/2605.29996</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29996v2 Announce Type: replace-cross Abstract: Accurate modeling of electric potential and current distribution in head tissues is crucial for the design and evaluation of neuro-sensing and neuro-stimulation systems operating in the sub-megahertz frequency range. Numerical methods are widely employed in electromagnetic simulations, however their computational cost can limit their applicability to rapid prototyping, real-time simulations, and circuit-level integration. In this work, we introduce a lumped RC equivalent circuit model that reproduces the electrical behavior of a canonical three-layer spherical head geometry over a frequency range up to 50 kHz. The model accounts for frequency-dependent tissue conductivity and permittivity to capture dispersive effects, employing complex conductivity in the electro-quasi-static (EQS) regime. The circuit topology uses a minimal set of impedance elements in order to represent the essential mechanisms of electric signal propagation. Validation was performed using a dipolar brain source configuration for scalp voltage peak estimation, showing close agreement with semi-analytical solutions across different skull thicknesses and dipole eccentricities. In addition, the impact of tissue dispersion and capacitive branches on the model predictions was quantitatively assessed, showing their contribution to the overall fidelity of the proposed approach.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT</title>
  <link>https://arxiv.org/abs/2606.09953</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09953v1 Announce Type: new Abstract: Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Roc\&#39;io: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Laplace-Mixture Dipole Inversion for Quantitative Susceptibility Mapping</title>
  <link>https://arxiv.org/abs/2606.10240</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10240v1 Announce Type: new Abstract: Purpose: To develop an automatic dipole inversion method for quantitative susceptibility mapping (QSM) that preserves fine anatomical structures without the need for manual regularization-parameter tuning. Theory: The original approximate message passing with parameter estimation (AMP-PE) framework models image gradients with a single Laplace prior, which does not fully capture the heavy-tailed gradient distribution of brain susceptibility maps. This prior mismatch can lead to over-regularization and blocky reconstructions. We address this limitation by modeling the gradients with a two-component Laplace mixture prior. Methods: We propose a Laplace-Mixture Dipole Inversion (LAMDI) method by incorporating a two-component Laplace mixture prior into the AMP-PE framework with automatic parameter estimation. LAMDI was evaluated on a public in vivo dataset. Its performance was compared with FANSI, MEDI, and AMP-PE with a single-Laplace prior (AMP-PE-L1) under both standard default and reference-tuned settings. Results: On a public multi-orientation QSM dataset, LAMDI achieved NRMSE and SSIM comparable to AMP-PE-L1 while substantially reducing HFEN, suggesting improved preservation of high-frequency anatomical detail. Under reference-based tuning, FANSI and MEDI achieved the best performance for some metrics, but LAMDI remained competitive without requiring reference maps or manual regularization tuning. Conclusion: LAMDI provides an effective and automatic parameter-estimation alternative for QSM dipole inversion by combining competitive reconstruction accuracy with improved preservation of fine anatomical detail.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>POPSICLE: Benchmark Datasets for Segmentation and Localization in CryoET</title>
  <link>https://arxiv.org/abs/2606.10255</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10255v1 Announce Type: new Abstract: Cryo-electron tomography (cryoET) has emerged as a powerful tool in structural and cellular biology by enabling direct visualization of macromolecular structures within intact cells, thereby linking molecular architecture to cellular organization in a native context. Realizing the full potential of cryoET, however, increasingly depends on advances in computational analysis, particularly machine learning (ML), to interpret its complex and information-rich data. Despite rapid progress, ML development for cryoET remains bottlenecked by the lack of standardized, well-annotated benchmarks. Existing evaluations are typically small, task-specific, and are assembled in isolation, limiting robust comparisons across methods. Here, we present POPSICLE, a benchmark suite for cryoET segmentation and macromolecular localization built from the CryoET Data Portal - an open, ML-ready repository of tomographic data, metadata, and annotations. POPSICLE spans eukaryotic and prokaryotic systems, both purified and fully in situ samples, and dense voxel-wise segmentation as well as sparse localization tasks. Built on a living data resource, it can expand as new datasets and annotations become available. Baseline experiments reveal substantial variation in model rankings across tasks, underscoring the need for benchmarks tailored to the unique characteristics of cryoET rather than evaluation practices adapted from adjacent biomedical imaging domains. POPSICLE thus provides an open and extensible foundation for reproducible ML evaluation in cryoET.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Overlapped Wavelet Diffusion for Low-Light Image Enhancement</title>
  <link>https://arxiv.org/abs/2606.10280</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10280v1 Announce Type: new Abstract: In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unsupervised Deep Learning for Limited-Angle STEM-EDX Tomography -- Application to 3D Chemical Analysis of Phase-Change Memory Devices</title>
  <link>https://arxiv.org/abs/2606.10547</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10547v1 Announce Type: new Abstract: Energy Dispersive X-ray (EDX) tomography in Scanning Transmission Electron Microscopy (STEM) enables 3D compositional and elemental mapping at the nanoscale, but its use is limited by restricted tilt ranges and low-dose conditions required to avoid beam damage. Limited-angle acquisition introduces missing-wedge artefacts such as elongation and anisotropic resolution, while noisy low-dose data further degrade reconstruction quality and quantitative reliability. Here, we introduce an unsupervised deep learning framework based on Deep Image Prior with total variation regularization (DIP-TV) for limited-angle STEM-EDX tomography. We extend it to a multi-channel formulation (DIPm-TV) that jointly reconstructs multiple elemental maps by exploiting spatial correlations. Using a synthetic 3-channel phantom, we show that the method compensates for severe missing-wedge artefacts corresponding to approximately $100^\circ$ of missing angular range under moderate noise, outperforming simultaneous iterative reconstruction technique and compressed sensing approaches. We apply the method to 3D chemical analysis of Ge-Sb-Te (GST) memory devices in virgin (as-fabricated) and SET (crystalline) operational states. Samples were prepared as cross-sectional focused ion beam lamellae and acquired under a limited-angle tilt range from $-40^\circ$ to $+40^\circ$ with $5^\circ$ steps and a dose of $2.0\times10^5$ $e^-/Ang^2$. The multi-channel approach enables voxel-by-voxel elemental reconstruction using only EDX signals without external structural priors such as high-angle annular dark-field imaging. The reconstructed volumes show near-isotropic spatial resolution and reveal compositional heterogeneities associated with device operation. This approach enables 3D chemical characterization in experimentally accessible sample geometries where conventional methods fail due to severe angular limitations.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation</title>
  <link>https://arxiv.org/abs/2606.10713</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10713v1 Announce Type: new Abstract: The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Low-Dose 3D Bonding Mapping Through &quot;Soft&quot; Core-Loss EELS Tomography and Unsupervised Deep Learning</title>
  <link>https://arxiv.org/abs/2606.10893</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10893v1 Announce Type: new Abstract: Resolving the 3D chemical configuration of beam-sensitive nanomaterials at high spatial resolution remains a persistent frontier in scanning transmission electron microscopy (STEM). The main limitation lies in the trade-off between high electron dose required for analytical signals and the large number of projections needed for tomographic reconstruction. Here, we achieve dose-efficient 3D bonding mapping of FeO/Fe$_3$O$_4$ core-shell nanocubes with high resolution via electron energy loss spectroscopy (EELS). Our approach relies on two developments. First, a standardless &quot;soft&quot; core-loss EELS methodology exploiting Fe-M$_{2,3}$ edges provides ${\sim}50\times$ higher dose efficiency than conventional Fe-L$_{2,3}$ edges, using the latter only as a source of FeO and Fe$_3$O$_4$ standards. Second, we introduce multi-channel deep image prior with total variation regularization (DIPm-TV), an unsupervised method for spectroscopic tomography that jointly reconstructs multiple channels by exploiting spatial correlations under sparse-view and low-dose conditions. Using simulated datasets, high-quality reconstructions are obtained from as few as nine projections over $-70^\circ$ to $+70^\circ$, without HAADF-STEM signal or symmetry constraints. Applied to FeO/Fe$_3$O$_4$ nanocubes, Fe-M$_{2,3}$ EELS maps show improved SNR and spatial resolution, revealing a thin outer FeO shell surrounding the magnetite shell. DIPm-TV yields ${\sim}1$ nm isotropic resolution oxidation-state volumes preserving cubic morphology, recovering the outer FeO shell, and revealing a small internal void, features not accessible with conventional reconstruction methods. This work establishes a pathway for low-dose 2D and 3D analytical mapping of beam-sensitive materials using shallow core-loss edges, enabling orders-of-magnitude dose reduction while maintaining spectral fidelity and reliable 3D information.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Multimodal Brain Tumour Classification Using Feature Fusion</title>
  <link>https://arxiv.org/abs/2606.11107</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11107v1 Announce Type: new Abstract: Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Safecloud: A Distributed, Encrypted Storage Cloud for Streaming</title>
  <link>https://arxiv.org/abs/2606.09870</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09870v1 Announce Type: cross Abstract: We present Safecloud, a distributed, encrypted, self-pricing storage and streaming network whose storage and routing nodes never see plaintext and never hold keys. Each file is split into chunks, encrypted on the owner&#39;s device, and distributed across Drops (browser tabs storing ciphertext in IndexedDB) and Jets (federated routing servers). Only the owner, or an authorised grantee, can decrypt. We make five contributions: (1) A one-root key hierarchy: every key derives deterministically from a single root via HKDF, and owner and range-scoped grantee derive identical chunk keys (derivation agreement); a subtree key derives its range and nothing else (delegation containment). (2) Convergent content addressing: identical content yields identical ciphertext and identifiers, enabling deduplication without plaintext exposure, with identifiers binding authenticated ciphertext so a keyless Drop verifies integrity (blind verifiability). (3) Three parallel trees over one navigation path (Merkle for integrity, key-derivation for confidentiality, access for authorisation), with sound Merkle-verified retrieval. (4) The key tree doubles as a streaming index: a player derives each segment key in O(1), seeking by derivation, while parallel tracks (video, audio, captions) are independent subtrees unlockable per-track and per-segment, a combination we believe no prior encrypted-storage network offers. (5) Jets and Drops earn Safebux verifiably, kept honest by a one-signature proof-of-storage challenge under chilling-effect Proof-of-Corruption, a zero-sum economy that is significantly cheaper than Filecoin&#39;s proof-of-replication sealing (which is slow and provides no confidentiality). We give the architecture, cryptographic construction, a threat model, and an open-source reference implementation, stating precisely what is implemented versus designed.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images</title>
  <link>https://arxiv.org/abs/2501.01481</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2501.01481v2 Announce Type: replace Abstract: Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Cyst-X: A Multi-Center MRI Benchmark and Federated Learning Framework for Malignancy-Risk Stratification of Pancreatic Cystic Neoplasm</title>
  <link>https://arxiv.org/abs/2507.22017</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.22017v4 Announce Type: replace Abstract: Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we introduce Cyst-X, a multi-center MRI benchmark and a federated learning framework for IPMN malignancy-risk stratification. The dataset comprises 1,461 abdominal MRI scans from 764 patients at seven international centers, with three-tier malignancy labels anchored in histopathology or three-year imaging follow-up and expert pancreas segmentations. The pipeline couples the PanSegNet pancreas segmenter with a 3D DenseNet-121 classifier and a parallel radiomics predictor. On internal cross-validation, the deep learning classifier reached a mean area under the receiver operating characteristic curve (AUC) of 0.85 (95% confidence interval 0.84-0.86) on T2-weighted MRI for high-risk versus low- or no-risk discrimination, with the average precision rising from a prevalence baseline of 0.23 to 0.64. This performance was preserved (AUC 0.85, FedProx) when training was distributed across institutions without exchange of raw patient images. Benchmarked against three blinded radiologists on a 629-case reader subset evaluated under imaging-only conditions, the classifier matched or exceeded sensitivity at comparable specificity. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset, segmentation masks, and trained models as the first large-scale, multi-centre MRI resource for pancreatic cystic neoplasm analysis.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor</title>
  <link>https://arxiv.org/abs/2511.19706</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2511.19706v2 Announce Type: replace Abstract: Rotation invariance is a fundamental requirement across many computer vision tasks. Historically, this inductive bias has been encoded through hand-crafted rotation-invariant representations. These are compact, interpretable, and fast to compute, but they come at the cost of descriptive power. More recently, architectures achieve inductive bias through learned representations. These are highly descriptive and achieve strong empirical performance, at the cost of efficiency and interpretability. In this work, we propose an alternative at the intersection of both paradigms. We introduce the selective disk bispectrum (SDB), a complex-valued rotation-invariant vector that preserves all information about the image except its orientation. Our key theoretical contributions are the selective disk bispectrum, its inversion, its (reduced) spatial and computational complexities (compared to the full disk bispectrum), and its expectation and variance under noise. Furthermore, we propose a numerical SDB approximation and provide theoretical guarantees for its accuracy and rotation invariance. Empirically, we validate SDB&#39;s invariance and robustness to noise classification tasks. We test our reconstruction algorithm on multi-reference alignment of rotated images.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Real-time 3D Visualization of Radiance Fields on Light Field Displays</title>
  <link>https://arxiv.org/abs/2508.18540</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.18540v2 Announce Type: replace-cross Abstract: Radiance fields, including their recent efficient forms such as 3D Gaussian Splatting and Sparse Voxels, have revolutionized photorealistic 3D scene visualization by enabling high-fidelity reconstruction of complex environments, making them a natural match for light field displays. However, integrating these technologies presents significant computational challenges, as light field displays require many high-resolution renderings from slightly shifted viewpoints, while radiance fields rely on computationally intensive volume rendering, which is intractable to achieve real-time speeds even with efficient scene representations. In this paper, we propose a unified and efficient framework for real-time radiance field rendering on light field displays. Rather than re-rendering each view independently, our method converts the input radiance field into shared intermediate sweeping planes that can be efficiently composited into dense light-field views in a single pass. Our method prioritizes shared, non-directional plane caching for real-time performance, trading fine view-dependent color effects for a modest increase in intermediate memory usage. Our framework generalizes across different scene representations without retraining and avoids repeated computation across views. We further demonstrate a real-time interactive application on a Looking Glass display, achieving 200+ FPS at 512p across 45 rendered views and enabling seamless, immersive 3D interactive viewing experiences. On standard benchmarks, our method achieves up to 22x speedup compared to independently rendering each view, while largely preserving image quality.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Curved Beam Enabled Wireless Communications: Modeling, Analysis and Optimization</title>
  <link>https://arxiv.org/abs/2606.10164</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10164v1 Announce Type: new Abstract: In this paper, the problem of using curved beams to improve wireless communication performance in the presence of a blockage is studied. In particular, a transmitter equipped with a continuous aperture array can generate curved beams to serve multiple receivers by allowing signals to propagate along both straight and curved paths. To optimize the weighted sum-rate, a curved beam model is developed for controlling the beam steering, beam focusing, and beam curving functions, along with a segmented channel model to characterize practical channels induced by the blockage. Based on the introduced curved beam model, an optimization problem is posed with the goal of maximizing the weighted sum-rate of all users under a transmit power budget and physical constraints of curved beams. To solve this problem, the continuous aperture is first converted into finite summations via a discrete sampling of the continuous coordinate. Then, the performance gap between the ideal continuous aperture design and its practical discrete aperture approximation is analyzed. Based on the above discrete approximation, an iterative algorithm is developed to optimize curved beam control parameters. In particular, the original problem is reformulated as a trackable form via fractional programming (FP). Then, the transformed problem is solved by designing an enhanced block coordinate ascent (BCA) method which determines a surrogate-construction point leveraging the local descent from previous iterations, thereby accelerating convergence. Then, a proximal regularization term is included into the surrogate function to control the update magnitude and suppress aggressive update, thereby improving updates stability. Finally, the beam amplitudes are computed based on the effective channel gains. Simulation results show that the proposed method can improve the weighted sum-rate compared to using only straight beam.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Optimal Illumination via Joint Movement and Phase Optimization for Movable Antenna-RIS Configuration</title>
  <link>https://arxiv.org/abs/2606.10190</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10190v1 Announce Type: new Abstract: Reconfigurable intelligent surfaces (RIS) enable programmable control of wireless propagation but remain vulnerable to persistent deep fades in static deployments. This paper introduces a Movable Antenna-enhanced RIS (MA-RIS) architecture where antenna elements physically reposition to sample independent spatial channels, enabling mobility-induced diversity. We model antenna motion using a Stochastic Differential Equation (SDE) framework capturing controlled drift and environmental diffusion. It^o calculus-based analysis characterizes steady-state antenna distributions, spatial decorrelation, and outage probability, revealing fundamental trade-offs between control strength and mobility randomness. To maximize long-term SNR while accounting for control overhead, we propose an overhead-aware Two-timescale framework separating slow antenna trajectory control from fast phase adaptation. The stochastic optimal control problem is solved via predictive approximation of the Hamilton-Jacobi-Bellman (HJB) formulation, enabling real-time implementation. Simulations validate theoretical predictions: the Two-timescale strategy achieves up to 36 dB steady-state SNR with remarkable stability, outperforming position-only control by up to 15 dB and uncontrolled baselines by over 30 dB. Despite experiencing a lower SNR than Active RIS, the proposed approach delivers up to 16 times higher energy efficiency (EE) across varying system scales, establishing a new paradigm of mobility-enabled channel adaptation for resilient wireless systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Simplified Temporal Convolutional-Based Channel Estimation for a WiFi Vehicular Communication Channel</title>
  <link>https://arxiv.org/abs/2606.10511</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10511v1 Announce Type: new Abstract: Channel estimation in vehicular communication is a crucial element in the advancement of intelligent transportation systems. However, the use of pilot signals in the IEEE 802.11p standard is insufficient for accurate channel estimation in high-mobility scenarios. Data pilot-aided (DPA) estimation helps address this, but suffers from demapping errors. We propose a simplified Temporal Convolutional Network-based estimator (DPA-TCN) trained on a mixed signal-to-noise ratio dataset to improve estimation performance and reduce computational complexity. Our DPA-TCN estimator achieves a bit error rate comparable to a state-of-the-art long-short-term memory network with DPA and temporal averaging (LSTM-DPA-TA) while reducing the complexity of the model by approximately 65%.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Complex VAE with Heavy-Tailed Likelihood for Radar Target Detection in Sea Clutter</title>
  <link>https://arxiv.org/abs/2606.10540</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10540v1 Announce Type: new Abstract: To address the heavy-tailed, spike-prone nature of sea clutter and the scarcity of labeled target data, an unsupervised complex-valued variational autoencoder (VAE) for maritime radar target detection is proposed. In implementation, each complex baseband slow-time sequence is represented by its in-phase and quadrature components, and the model learns their joint reconstruction from clutter-only data. A Student-\(t\) negative log-likelihood is adopted to capture heavy-tailed reconstruction errors while reducing sensitivity to outliers during clutter learning. In addition, a time-domain amplitude error constraint is introduced to penalize slow-time magnitude mismatch in the reconstruction. At inference, reconstruction deviation is used as the detection statistic, and the decision threshold is set via an empirical quantile estimated from a clutter-only validation set to enforce a constant false-alarm rate (CFAR). Experiments on measured sea-clutter data show that detection performance is consistently improved over MF, AMF, and a real-valued \(\beta\)-VAE under CFAR constraints.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Information Bottleneck Meets Quantization: Finite Rate Analysis and Optimal Designs</title>
  <link>https://arxiv.org/abs/2606.10869</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10869v1 Announce Type: new Abstract: The Information Bottleneck (IB) is a well established framework that looks for a latent compact representation of a data source, by trading rate and data-size representation, for information accuracy with respect to another target data. The Gaussian IB (GIB) is its simple closed form solution, when the target is jointly Gaussian with the source. Actually, in many practical problems the latent representation has to be stored or represented by a finite number of bits, while the optimal (G)IB solution has not. First, this manuscript theoretically analyzes the effect of scalar and vector quantization of the GIB latent representation, and its impact on the (dis)informativeness with respect to the target data. Then, task-oriented quantization designs are proposed by (jointly) reformulating the GIB optimization problem under a finite-rate constraint on the latent representation. Simulation results on MMSE regression problems confirm the effectiveness of the proposed quantization designs, which show significant gains with respect to more heuristic, or separate, quantization designs of the standard GIB latent representation. Finally, the paper extends the task-oriented philosophy to non-Gaussian settings, by properly modifying the cost function used in variational auto-encoders (VAEs) of IB-inspired vector quantizers.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Personalized Deep Learning for Short-Term Forecasting of Impending Atrial Fibrillation from Continuous Wearable ECG Signals</title>
  <link>https://arxiv.org/abs/2606.10900</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10900v1 Announce Type: new Abstract: Background and Objective: Continuous wearable electrocardiogram (ECG) monitoring is increasingly used for ambulatory arrhythmia surveillance, yet forecasting impending atrial fibrillation (AF) is challenged by inter-patient ECG variability. This study investigated whether personalizing a global model via fine-tuning on an individual&#39;s ECG signals improves short-term forecasting of impending AF. Methods: A global model trained on the ICENTIA11K dataset was compared against personalized models fine-tuned across three cohorts: ICENTIA11K, IRIDIA-AF, and MobiCARE. Following preprocessing, models processed 60-second ECG segments for a five-minute forecast horizon. We evaluated the impact of adaptation data volume and analyzed ECG features, such as heart rate and RMSSD. Results: Personalized models significantly outperformed the global model, achieving AUROCs of 0.711 vs. 0.614 in ICENTIA11K and 0.686 vs. 0.585 in MobiCARE. Personalization benefits increased with the amount of patient-specific fine-tuning data. While the global model&#39;s accuracy rose as AF onset approached, personalized models in the two external cohorts exhibited distinct temporal dynamics, which may indicate the capture of patient-specific characteristics less dependent on proximity to the AF event. Pre-AF episodes showed elevated heart rates and RMSSD. Feature attributions highlighted clinically relevant precursors, including frequent premature atrial complexes (PACs) and short supraventricular tachycardias (SVTs). Conclusions: Adapting deep learning models with patient-specific wearable ECG data significantly enhances short-term forecasting of impending AF. This personalized framework supports timely preventive interventions and improved AF management in ambulatory monitoring environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Multi-Channel Soil Moisture Measurement: High Accuracy and Low Crosstalk Through Optical-Semiconductor Based Differential Sensing</title>
  <link>https://arxiv.org/abs/2606.11020</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11020v1 Announce Type: new Abstract: Soil moisture measurement plays a key role in irrigation and environmental management. Yet it remains unreliable due to heterogeneous soils, limited sensing volumes, temperature drift, and parasitic inter-channel coupling. This work presents a compact multi-depth capacitive probe that extends a parallel-plate geometry from previous work with differential activation to suppress stray capacitances and improve accuracy. An equivalent-circuit model quantifies parasitic effects, and optically coupled transistor bridges isolate each sensing layer. Raw capacitance is converted to volumetric water content and plant-available water using established calibration models. Laboratory results show a fourfold reduction in temperature sensitivity, strong confinement of the sensing volume, and improved repeatability in heterogeneous soils. Field validation against reference sensors demonstrates high accuracy and precision comparable to widely used instruments, enabling a practical and scalable solution for agricultural and urban soil-moisture monitoring.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals</title>
  <link>https://arxiv.org/abs/2606.11125</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11125v1 Announce Type: new Abstract: Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Pre-Fault Voltage Discrimination and Time-Domain Protection for Distribution Networks with Inverter-Based Resources</title>
  <link>https://arxiv.org/abs/2606.11135</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11135v1 Announce Type: new Abstract: The increasing proliferation of inverter-based resources (IBRs) in distribution networks is presenting a major challenge for phasor-based overcurrent protection. This challenge stems from IBRs&#39; lack of short-circuit current sourcing capacity. As a result, traditional overcurrent protection functions (e.g., ANSI 51) are inadequate in such scenarios, and warrant alternative approaches. Time-domain protection, for example, shows promise in overcoming this challenge. In this paper we propose a pre-fault voltage discrimination (PVD) strategy whose role is to detect faults and discriminate normal switching and transformer inrush disturbances from actual faults. The use of PVD allows for the design of a simple, yet effective fault detection algorithm by using time-domain protection principles for distribution networks containing IBRs. The introduction of PVD provides for faster fault detection without reducing security and dependability. Offline simulation experiments and controller hardware-in-the-loop real-time simulation validate the effectiveness of the proposed algorithm against various fault and normal switching events.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Structured Adaptive Tensor Prediction for Streaming Data</title>
  <link>https://arxiv.org/abs/2606.10085</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10085v1 Announce Type: cross Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and time-varying environments. Adaptive filtering techniques have also been largely limited to data with scalar or vector values, leaving adaptive forecasting for matrix-valued time series inadequately understood. To bridge these gaps, we develop an adaptive tensor regression framework that includes Matrix-on-Matrix (MoM) and Tensor-on-Matrix (ToM) formulations for streaming matrix-valued prediction. The two formulations differ in whether to directly model matrix-valued outputs or to exploit temporal structure via higher-order tensor representations. For the proposed tensor regression framework, we develop stochastic gradient descent (SGD) algorithms for online learning. We show that stacking multiple responses across time into higher-order tensors improves performance; in particular, the ToM achieves lower steady-state error and stronger denoising capability than MoM, motivating our focus on the ToM model. We further characterize the tracking behavior of SGD under time-varying dynamics. From a statistical perspective, we establish fixed-time recovery guarantees for ToM under general low-dimensional structures, including sparsity, low-rankness, and their joint sparselow-rank models.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection</title>
  <link>https://arxiv.org/abs/2606.10410</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10410v1 Announce Type: cross Abstract: Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Learning Doubly Sparse Explicitly Conditioned Transforms</title>
  <link>https://arxiv.org/abs/2606.10975</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10975v1 Announce Type: cross Abstract: Finding convenient spaces in which certain hypotheses regarding an assumed sparse structure of natural signals hold true has become a desirable result in recent research, its implications being reflected in areas such as data compression, noise reduction and feature extraction. While the extensively used analytical transforms, such as DFT or DCT, already provide efficient algorithms and robust sparse representations, they assume a fixed prior about the data, failing to accurately capture the specific structure of more restrictive classes of signals. To address this, the concept of a data-adaptive, learnt transform has been introduced in the literature, allowing for the reduction of a residual term in the transform domain. More recent studies have shown that the condition number serves as a good metric in this context, where the desired outcome alternates between a generalizing tendency and one that achieves minimal approximation error. Motivated by these considerations, we introduce the learning of a structured, explicitly conditioned transform formulated as the product of a fixed canonical matrix and a refining data-adaptive sparse component. This approach seeks to preserve the advantages of fast and stable analytical transforms, while introducing controllable adaptivity to the data. No references that concern this specific formulation have been identified so far, indicating its novelty. The proposed algorithm is motivated within the framework of inexact proximal methods, leveraging a newly derived closed-form projection operator. Empirical observations demonstrate state-of-the-art results on the doubly sparse transform learning problem and comparable performance with its dense variant at significantly lower computational costs and sometimes faster convergence and better avoidance of bad local minima.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Federated Learning Enhanced by Feature Reconstruction for Semantic Communication Module Updates of Agents</title>
  <link>https://arxiv.org/abs/2508.03248</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.03248v3 Announce Type: replace Abstract: Recent advancements in semantic communication have primarily focused on image transmission, where neural network-based joint source-channel coding modules play a central role. However, such systems often experience semantic communication errors due to mismatched knowledge bases between agents and performance degradation from outdated models, necessitating regular model updates. To address these challenges in vector quantization (VQ)-based image semantic communication systems, we propose FedSFR, a novel federated learning framework that incorporates semantic feature reconstruction (FR). FedSFR introduces an FR step at the parameter server and allows a subset of clients to transmit compact feature vectors in lieu of sending full local model updates, thereby improving training stability and communication efficiency. To enable effective FR learning, we design a loss function tailored for VQ-based image semantic communication and demonstrate its validity as a surrogate for image reconstruction error. We further establish a rigorous convergence analysis of FedSFR. Experimental results on two benchmark datasets validate the superiority of FedSFR over existing baselines, especially in capacity-constrained settings, confirming both its effectiveness and robustness.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Delay-Doppler Domain Channel Measurements and Modeling in High-Speed Railways</title>
  <link>https://arxiv.org/abs/2509.25854</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.25854v2 Announce Type: replace Abstract: As next-generation wireless communication systems need to be able to operate in high-frequency bands and high-mobility scenarios, delay-Doppler (DD) domain multicarrier (DDMC) modulation schemes, such as orthogonal time frequency space (OTFS), demonstrate superior reliability over orthogonal frequency division multiplexing (OFDM). Accurate DD domain channel modeling is essential for DDMC system design. However, since traditional channel modeling approaches are mainly confined to time, frequency, and space domains, the principles of DD domain channel modeling remain poorly studied. To address this issue, we propose a systematic DD domain channel measurement and modeling methodology in high-speed railway (HSR) scenarios. First, we design a DD domain channel measurement method based on the long-term evolution for railway (LTE-R) system. Second, for DD domain channel modeling, we investigate quasi-stationary interval, statistical power modeling of multipath components, and particularly, the quasi-invariant intervals of DD domain channel fading coefficients. Third, via LTE-R measurements at 371 km/h, taking the quasi-stationary interval as the decision criterion, we establish DD domain channel models under different channel time-varying conditions in HSR scenarios. Fourth, the accuracy of proposed DD domain channel models is validated via bit error rate comparison of OTFS transmission. In addition, simulation verifies that in HSR scenario, the quasi-invariant interval of DD domain channel fading coefficient is on millisecond (ms) order of magnitude, which is much smaller than the quasi-stationary interval length on 100 ms order of magnitude. This study could provide theoretical guidance for DD domain modeling in high-mobility environments, supporting future DDMC and integrated sensing and communication designs for 6G and beyond.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Gridless Full-Space DOA Estimation for STAR-RIS-Assisted Wireless Systems</title>
  <link>https://arxiv.org/abs/2602.02893</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2602.02893v2 Announce Type: replace Abstract: Simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) enable full-space ($0^\circ$--$360^\circ$) signal coverage, making them a compelling platform for integrated sensing and communication in next-generation wireless networks. In this paper, we investigate gridless direction-of-arrival (DOA) estimation across the full spatial domain in STAR-RIS-assisted systems operating with a single RF sensing chain. We show that the coupled reflection-transmission mechanism of STAR-RIS induces a multichannel finite-rate-of-innovation (FRI) structure in the received signal, which enables casting DOA estimation as a structured low-rank recovery problem without angular grid discretization. Building on this observation, we develop a proximal gradient descent algorithm with alternating projections onto a block-Hankel matrix set, enabling robust angle retrieval from limited measurements. Two practically relevant STAR-RIS configurations are addressed: element-wise uniform and nonuniform energy-splitting designs, each handled through a dedicated lifting strategy that preserves the underlying algebraic structure. A Ziv-Zakai bound is derived for the coupled full-space sensing model as a performance benchmark across the full SNR range. Numerical results show that the proposed methods consistently outperform grid-based baselines, achieving sub-degree accuracy within $\pm 60^\circ$ of boresight at comparable or lower computational cost.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Towards 6G Single-Anchor Vehicle Localization Exploiting Radio-Reflective Road Markings in Tunnel Environments</title>
  <link>https://arxiv.org/abs/2604.04217</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.04217v2 Announce Type: replace Abstract: Accurate vehicular localization remains a key challenge for cooperative intelligent transport systems (C-ITS), especially in areas without global navigation satellite system (GNSS) coverage, such as road tunnels. This paper proposes a novel vehicle positioning method with a single anchor equipped with multiple antennas, exploiting near-field (NF) propagation and passive radio-reflective structures deployed along the GNSS-denied tunnel. The method assumes a wideband vehicle-to-everything (V2X) communication between the vehicle and the anchor, in line with the undergoing standardization of cellular V2X beyond 5G. We first derive the validity condition that allows us to approximate the multipath channel with a single reflector point, defining a geometry validity bound on the number of antennas that can be employed. Building on this result, we propose JAVELIN, a 6G-compatible single-anchor localization framework that leverages tensor-based NF parameter estimation, adaptive NF/far-field (FF) processing, and recursive Bayesian tracking to enable sub-meter positioning without multi-anchor synchronization. The method integrates angle, delay difference, and curvature measurements into a variable-dimension extended Kalman filter with gated nearest-neighbor association, enabling operation without prior environmental knowledge. Radio-reflective road markings are further introduced to enhance geometric diversity. Simulation results in realistic tunnel scenarios demonstrate accurate and robust localization under different conditions, outperforming state-of-the-art single-anchor approaches and benefiting from passive reflector deployment</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Stability Analysis for Autoregressive Sampling Sets</title>
  <link>https://arxiv.org/abs/2606.03942</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03942v2 Announce Type: replace Abstract: Motivated by recent developments in stochastic modeling of clock jitter in Analog-to-Digital Converters (ADCs) as autoregressive processes of order one (AR(1)), we study the density and stability properties of AR(1)-jittered sampling sets for Paley-Wiener signals. We show that, despite having the correct asymptotic density both on average and almost surely, such sets almost surely fail to be stable sampling sets. We complement this negative result with a finite-dimensional analysis, showing that the corresponding jittered sinc matrices are nonetheless well-conditioned with high probability.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Symmetry-Aware Convex Shrinkage for High-Dimensional Covariance Estimation</title>
  <link>https://arxiv.org/abs/2605.17111</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.17111v2 Announce Type: replace-cross Abstract: We develop a class of data-adaptive shrinkage estimators for high-dimensional covariance estimation in which the shrinkage target is a Reynolds projection of the sample covariance under a finite symmetry group selected from a candidate library by held-out predictive performance. The class generalizes the convex shrinkage estimator of Ledoit and Wolf by replacing the scalar-identity target with a structured target derived from a symmetry group when one is available, and generalizes the group-symmetric maximum-likelihood estimator of Shah and Chandrasekaran by combining structural targeting with adaptive convex shrinkage and by selecting the group from data rather than treating it as prespecified. A two-tier procedure performs the group selection: a universal per-candidate evaluation based on held-out negative log-likelihood, optionally preceded by a domain-specific step that constructs the candidate library from structural priors. We establish a finite-sample regret bound for the held-out calibration of the convex combination weight, an oracle inequality for the data-driven group selection, and a quantitative sufficient-match condition under which the proposed estimator dominates Ledoit-Wolf shrinkage in Frobenius mean-squared error. The procedure is illustrated on six real-data problems spanning finance (S&amp;P~500 daily returns), climate (NOAA OISST sea-surface temperature anomalies), genomics (TCGA-BRCA gene expression), radio signal processing (RadioML 2018.A), astronomical imaging (Galaxy10 DECaLS), and natural image patches (CIFAR-10 with a CIFAR-10.1 distribution-shift companion). An empirical comparison is also made against the Bayesian permutation-symmetry estimator of Chojecki and colleagues. Outside the few-shot regime, where structural priors carry the most information per observation, Ledoit-Wolf shrinkage remains the appropriate baseline.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Lumped RC Equivalent Circuit of Head Tissues for Dispersive Neuro-Electromagnetic Modeling</title>
  <link>https://arxiv.org/abs/2605.29996</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29996v2 Announce Type: replace-cross Abstract: Accurate modeling of electric potential and current distribution in head tissues is crucial for the design and evaluation of neuro-sensing and neuro-stimulation systems operating in the sub-megahertz frequency range. Numerical methods are widely employed in electromagnetic simulations, however their computational cost can limit their applicability to rapid prototyping, real-time simulations, and circuit-level integration. In this work, we introduce a lumped RC equivalent circuit model that reproduces the electrical behavior of a canonical three-layer spherical head geometry over a frequency range up to 50 kHz. The model accounts for frequency-dependent tissue conductivity and permittivity to capture dispersive effects, employing complex conductivity in the electro-quasi-static (EQS) regime. The circuit topology uses a minimal set of impedance elements in order to represent the essential mechanisms of electric signal propagation. Validation was performed using a dipolar brain source configuration for scalp voltage peak estimation, showing close agreement with semi-analytical solutions across different skull thicknesses and dipole eccentricities. In addition, the impact of tissue dispersion and capacitive branches on the model predictions was quantitatively assessed, showing their contribution to the overall fidelity of the proposed approach.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks</title>
  <link>https://arxiv.org/abs/2606.10301</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10301v1 Announce Type: new Abstract: Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Towards Paradigm-General Suicide Risk Detection via Speech LLM</title>
  <link>https://arxiv.org/abs/2509.22153</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.22153v2 Announce Type: replace Abstract: Suicide risk among adolescents remains a critical public health concern, and speech provides a non-invasive and scalable approach for its detection. Speech-based suicide risk assessment commonly relies on carefully designed speech elicitation paradigms (\textit{e.g.,} verbal fluency, reading, or question answering) to probe cognitive and affective states. Existing approaches, however, typically focus on one single paradigm at a time. This paper, for the first time, investigates cross-paradigm approaches that unify diverse speech elicitation paradigms within a single model. Specifically, we use a speech LLM as backbone with a mixture of DoRA experts (MoDE) to capture complementary cues across assessments dynamically, tested on 1,223 participants across ten speech elicitation paradigms. Results show that MoDE outperforms both paradigm-specific and conventional joint-learning models. Moreover, it can generalise to unseen paradigms and provide better confidence calibration.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios</title>
  <link>https://arxiv.org/abs/2507.19137</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.19137v2 Announce Type: replace Abstract: Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual&#39;s perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models</title>
  <link>https://arxiv.org/abs/2606.11167</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11167v1 Announce Type: cross Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Data-Driven Runway and Taxiway Exits Prediction of Landing Aircraft: A Case Study at Hartsfield-Jackson Atlanta International Airport</title>
  <link>https://arxiv.org/abs/2606.11017</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.11017v1 Announce Type: cross Abstract: Airport surface operations increasingly constrain performance at high-throughput hubs. This study examines arrival taxi-in decisions at Hartsfield-Jackson Atlanta International Airport (KATL) and proposes a two-stage, data-driven decision aid that mirrors controller workflow. Stage I predicts the runway exit selected by an arriving aircraft. Stage II predicts whether, given that exit, the aircraft will cross the active departure runway at a designated point or use the end-around taxiway. Models are trained using ASDE-X surface trajectories, aircraft characteristics, ramp destinations, short-horizon traffic rates, and weather across multiple look-back windows. We benchmark nine classifiers, including Random Forest, XGBoost, LightGBM, and CatBoost, and evaluate accuracy, macro-F1, precision-recall behavior, confusion matrices, Brier score, and Expected Calibration Error. Across east and west flows, XGBoost and LightGBM outperform Random Forest. Stage I achieves 0.86-0.89 accuracy with macro-F1 scores of 0.40-0.50, while Stage II achieves 0.70-0.74 accuracy with macro-F1 scores of 0.28-0.55. Feature-importance analysis shows that approach speed is the main driver of exit choice. Departure rate, crossing rate, ramp destination, and, for west flow, the selected exit are the strongest predictors of crossing versus end-around routing. Minority classes remain harder to predict because of feature-space overlap, as shown by t-SNE and UMAP analyses. The proposed framework supports controller situational awareness through calibrated, explainable predictions while preserving human responsibility for final routing decisions.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming</title>
  <link>https://arxiv.org/abs/2606.10675</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10675v1 Announce Type: cross Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models</title>
  <link>https://arxiv.org/abs/2606.10581</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10581v1 Announce Type: cross Abstract: Speech carries more information than just words: a child&#39;s voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>A Lightweight Dual-Factor Acoustic Authentication System via Cascaded GMM-DTW Architecture for Edge Computing</title>
  <link>https://arxiv.org/abs/2606.10565</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10565v1 Announce Type: cross Abstract: This paper presents a lightweight, cascaded GMM-DTW dual-factor voice lock system for resource-constrained edge environments. By utilizing a shared MFCC feature space, the framework implements a sequential defense mechanism combining GMM speaker screening and DTW passphrase verification. To counter presentation threats without extra hardware, a dynamic joint absolute-relative margin constraint is integrated into the GMM classification space, limiting the physical imposter and high-fidelity replay attack False Acceptance Rates (FAR) to 2.73% and 6.67%, respectively, with a legitimate False Rejection Rate (FRR) of 16.67%. Due to Sakoe-Chiba window optimization, the global end-to-end processing latency under temporal stress is rigidly bounded at 9.82ms on a single-core CPU, comprising 1.51ms for feature extraction, 0.54ms for GMM scoring, and 7.77ms for worst-case DTW matching. These empirical benchmarks demonstrate the viability of white-box acoustic cascades for secure, deterministic real-time deployment on low-power edge nodes.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling</title>
  <link>https://arxiv.org/abs/2606.10439</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10439v1 Announce Type: cross Abstract: The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks</title>
  <link>https://arxiv.org/abs/2606.10972</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10972v1 Announce Type: new Abstract: This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Phoneme-First Prediction for LLM-Based Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.10864</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10864v1 Announce Type: new Abstract: Recent research has explored integrating Large Language Models (LLMs) with speech encoders to create speech-augmented LLMs capable of contextualized speech recognition. The main challenge lies in aligning the semantic embeddings of LLMs with the acoustic representations of speech encoders. We propose a novel approach that teaches the LLM to first predict phonemes from the speech features before generating the final transcript. By integrating a phoneme prediction step directly into the LLM, the model develops a fine-grained knowledge of pronunciation, reducing acoustic confusion and improving transcription accuracy and explainability. Our method is cheap and simple, as phoneme targets can be automatically derived from existing transcripts. Through comprehensive experiments, we show that intermediate phoneme prediction can improve speech recognition, particularly in low-resource settings, and yields outputs that are acoustically more faithful to the speech.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Speech Encoder Fusion for LLM-based Automatic Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.10853</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10853v1 Announce Type: new Abstract: Speech-aware large language models (LLMs) can incorporate speech through pre-trained acoustic encoders that project speech features into the LLM embedding space. While the choice of the speech encoder critically influences performance, different encoders often exhibit complementary strengths, motivating their combination. In this work, we investigate whether fusing multiple pre-trained speech encoders can enhance speech-aware LLMs for automatic speech recognition (ASR). We explore several fusion strategies beyond simple feature concatenation, including learned combinations and Transformer-based fusion architectures, and evaluate them across mono- and multilingual ASR settings as well as diarized speech recognition. Our results indicate that carefully fusing multiple parallel speech encoders improves downstream performance in all scenarios with limited computational overhead.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Towards Deep Contextual Reasoning from Broad Descriptions for ASR with Speech-LLM via Metadata-Driven Reasoning Chains</title>
  <link>https://arxiv.org/abs/2606.10838</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10838v1 Announce Type: new Abstract: Speech recognition often fails on rare, domain-specific terms and context-related named entities. Existing contextualization techniques typically bias decoding with keywords or phrase lists, which does not scale well or exploit deeper knowledge. We propose a training method that teaches a speech-LLM to use broad descriptions (e.g. from videos) as weak semantic priors to perform contextual reasoning grounded in the audio. We build 400 hours of reasoning-augmented speech data by pairing erroneous hypotheses with video metadata and LLM-generated reasoning explanations that justify context-driven corrections. We finetune the speech-LLM to perform chain-of-thought reasoning: generate an initial transcript, then reason over the context, and finally return a corrected transcript. On held-out YouTube-derived test sets, our approach reduces errors, with specific improvements on rare words and named entities, and lays groundwork for deeper contextual reasoning in speech recognition.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Recovering the Zipfian Distribution in Unsupervised Term Discovery</title>
  <link>https://arxiv.org/abs/2606.10781</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10781v1 Announce Type: new Abstract: Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning</title>
  <link>https://arxiv.org/abs/2606.10758</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10758v1 Announce Type: new Abstract: The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding</title>
  <link>https://arxiv.org/abs/2606.10738</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10738v1 Announce Type: new Abstract: Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>GC-LoRA: Gated Convolutional LoRA for Parameter-Efficient Acoustic Adaptation</title>
  <link>https://arxiv.org/abs/2606.10464</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10464v1 Announce Type: new Abstract: Transformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), adjust global attention, they lack the local context modeling crucial for capturing domain-specific variations. We propose GC-LoRA, a novel adapter architecture that injects Conformer-style local convolutional processing into pretrained Transformer encoders. By integrating a lightweight adapter to encoder attention output projections, our method efficiently captures local acoustic dependencies without disrupting pretrained global representations. Experiments across diverse datasets (acoustically-degraded, bandlimited, dialectal, child) demonstrate the efficacy of our approach, achieving Word Error Rate (WER) reductions of up to 10.9% compared to baselines while adding minimal trainable parameters.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR</title>
  <link>https://arxiv.org/abs/2606.10454</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10454v1 Announce Type: new Abstract: While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space</title>
  <link>https://arxiv.org/abs/2606.10317</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10317v1 Announce Type: new Abstract: We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling</title>
  <link>https://arxiv.org/abs/2606.10233</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10233v1 Announce Type: new Abstract: While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>LLM can Read Spectrogram: Encoder-free Speech-Language Modeling</title>
  <link>https://arxiv.org/abs/2606.10231</link>
  <pubDate>Wed, 10 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.10231v1 Announce Type: new Abstract: Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children&#39;s Speech via Multiscale Acoustic Fusion</title>
  <link>https://arxiv.org/abs/2606.08210</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08210v1 Announce Type: new Abstract: Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental &quot;searching&quot; behaviour, offering a more robust and interpretable tool for early clinical intervention.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals</title>
  <link>https://arxiv.org/abs/2606.08247</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08247v1 Announce Type: new Abstract: Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation</title>
  <link>https://arxiv.org/abs/2606.08393</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08393v1 Announce Type: new Abstract: Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Sound Field Interpolation Using Physics-Informed Extreme Learning Machine with Pre-Training</title>
  <link>https://arxiv.org/abs/2606.08435</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08435v1 Announce Type: new Abstract: Numerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time pose practical challenges for applications requiring real-time processing or online learning. To address this, we propose a hybrid framework that combines PINN-based pre-training with a physics-informed extreme learning machine (PIELM) tailored for acoustic fields. By replacing iterative PINN fine-tuning for each target sound field with closed-form output-layer adaptation using hidden-layer weights pre-trained by PINN, the proposed method efficiently interpolates unknown sound fields from limited observations. Simulation results under simplified one-dimensional free-field conditions demonstrate that, given a pre-trained model, the proposed method achieves interpolation accuracy comparable to that of PINN-based fine-tuning while reducing the adaptation time by more than three orders of magnitude.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines</title>
  <link>https://arxiv.org/abs/2606.08505</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08505v1 Announce Type: new Abstract: Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching</title>
  <link>https://arxiv.org/abs/2606.08580</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08580v1 Announce Type: new Abstract: Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training</title>
  <link>https://arxiv.org/abs/2606.08898</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08898v1 Announce Type: new Abstract: In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model&#39;s adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>BareWave: Waveform-Native Flow-Matching Text-to-Speech</title>
  <link>https://arxiv.org/abs/2606.09048</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09048v1 Announce Type: new Abstract: Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion</title>
  <link>https://arxiv.org/abs/2606.09050</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09050v1 Announce Type: new Abstract: Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis</title>
  <link>https://arxiv.org/abs/2606.09098</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09098v1 Announce Type: new Abstract: Video dubbing is a cornerstone of multimedia content creation, aiming to synthesize synchronized acoustic sequences for visual streams. While Text-to-Speech (TTS) and Text-to-Audio (TTA) generation have each achieved remarkable progress, existing dubbing systems remain confined to isolated speech synthesis without incorporating sound effects and ambient audio, forcing practitioners to rely on fragmented workflows and laborious manual post-mixing. To address this limitation, we present HoliDubber, a holistic video dubbing framework that moves beyond speech-only generation by enabling the joint synthesis of speech and sound effects from a single text prompt. Specifically, HoliDubber adopts a patch-based autoregressive diffusion transformer architecture, where a causal language model autoregressively models aggregated patch embeddings to capture global temporal structure, and a Diffusion Transformer decoder generates high-fidelity continuous tokens within each patch, following a divide-and-conquer strategy. To achieve cross-modal alignment, visual features are encoded into patch-level representations and fused with audio patches via cross-attention, enabling the model to ground speech generation in the speaker&#39;s visual articulation dynamics. In addition, we introduce HoliDub-Bench, a benchmark curated from established datasets with synchronized video-text-audio triplets designed for holistic dubbing evaluation. Extensive experiments demonstrate that HoliDubber significantly outperforms existing methods across multiple benchmarks in speech quality, synchronization, and speaker similarity. Furthermore, results on HoliDub-Bench validate the effectiveness of joint speech-and-sound generation, establishing a new paradigm for holistic video dubbing in complex acoustic scenes. \footnote{The demo page of the project is https://holidubber.github.io}</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation</title>
  <link>https://arxiv.org/abs/2606.09141</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09141v1 Announce Type: new Abstract: Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification</title>
  <link>https://arxiv.org/abs/2606.09317</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09317v1 Announce Type: new Abstract: Spoken language identification (LID) for Indian languages is a challenging problem due to the large number of languages, significant phonetic overlap among related varieties, and the scarcity of labeled data for many low-resource languages. In this work, we present a systematic comparative study of two pre-trained speech encoders -- Whisper and FastConformer -- combined with a linear classifier for large-scale Indic LID spanning 42 languages across four linguistic families. We evaluate both encoders in frozen (linear probing) and fine-tuned settings, and compare three training objectives: cross-entropy (CE), supervised contrastive loss with cross entropy (CE + supCon), and hierarchical softmax (HSM). Models are trained on the Vaani dataset and evaluated in a cross-corpus setting on Vaani-Test (held-out), FLEURS, and Kathbath, providing insights into domain generalization. The frozen FastConformer encoder achieves over 90\% macro accuracy on FLEURS and Kathbath without any task-specific adaptation, substantially outperforming Whisper on out-of-domain benchmarks, while fine-tuned Whisper yields stronger in-domain performance. HSM consistently outperforms CE and CE+SupCon for both encoders across all benchmarks, with the largest gains on out-of-domain test sets. CE+SupCon degrades FastConformer&#39;s cross-corpus generalization, suggesting that the contrastive objective over-specializes representations to in-domain conditions. Per-family analysis shows that Central Indo-Aryan varieties are the hardest to discriminate, with Hindi--Urdu and the Sadri--Chhattisgarhi--Surgujia cluster being the dominant confusion pairs.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Factors affecting ASR performance: A study using state of the art ASR models in Indic Languages</title>
  <link>https://arxiv.org/abs/2606.09335</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09335v1 Announce Type: new Abstract: ASR performance varies across languages, speakers, and recording conditions, yet systematic analysis for Indic languages remain limited. We present a large-scale study of decoded outputs from multiple open-source ASR models evaluated on diverse Indian speech datasets in zero-shot settings. We analyze linguistic, speaker-level, and acoustic factors across Hindi, Bengali, Kannada, Telugu, and Marathi. We examine correlations between WER and speaker traits such as average word length, speaking rate, and utterance duration across multiple model dataset pairs. For Hindi, we further analyze audio factors including telephone codecs, bit depth, resampling, and background noise. Results reveal both cross lingual patterns and language-specific sensitivities, showing how speaker behavior and signal processing choices affect ASR robustness in real world Indic scenarios.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Parameter-Efficient Continual Learning for Automatic Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.09342</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09342v1 Announce Type: new Abstract: Speech foundation models enable strong general-purpose ASR and are attractive for downstream adaptation. However, their size and the catastrophic forgetting induced by sequential fine-tuning demand parameter-efficient and regularized training methods, motivating parameter-efficient continual learning (PECL). While PECL has been widely studied in NLP and vision, it has received less attention in ASR. In this paper, we propose a simple yet effective PECL method based on recent advances in parameter-efficient fine-tuning for ASR. We partition pretrained weight matrices into head and tail subspaces according to singular values and restrict adaptation to approximate rotations within the low-energy tail subspace, preserving dominant components and reducing forgetting. For subsequent tasks, rotations are combined via weight averaging to further improve retention. Experiments on two benchmarks demonstrate reduced forgetting and superior overall performance compared to recent PECL baselines.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>A study on the impact of region specific data on the performance of Indic ASR</title>
  <link>https://arxiv.org/abs/2606.09345</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09345v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) systems are widely deployed across linguistically diverse regions, yet their ability to generalize across fine-grained geographic variation remains underexplored. We present a systematic study of cross-district ASR generalization for Indian languages, analyzing the impact of regional variation on performance. Using finetuning as a controlled probe, we train models on speech from a single district and evaluate them on other districts within the same language. We examine trends across multiple train test district pairs and quantify performance differences. To assess geographic effects, we analyze the correlation between WER and inter district distance using two distance measures. Our results show consistent correlations between geographic distance and WER, highlighting the challenges of regional generalization and the need for geographically diverse speech data in ASR development and evaluation in India.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Rethinking Depth: A study of the Recursive-Transformer for Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.09357</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09357v1 Announce Type: new Abstract: Transformer-based architectures have led to significant improvements in Automatic Speech Recognition (ASR), often at the cost of substantially increased model sizes. A promising approach to address this issue is layer sharing through depth recursion, commonly referred to as the Recursive-Transformer, which involves repeatedly applying the same layers within the model. Despite its potential shown in other fields, this technique remains relatively unexplored in ASR. In this paper, we present an experimental study of the Recursive-Transformer applied to ASR encoder architectures. We systematically investigate the impact of recursion depth and layer allocation within the Recursive-based Transformer. Our results demonstrate that the Recursive-Transformer is a viable alternative, especially when recurrence is applied in the latent space with a restricted number of loops, obtaining comparable performance while reducing the parameter count by 66%.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Your U-Net Dereverberation Model is Secretly an RIR Encoder</title>
  <link>https://arxiv.org/abs/2606.09557</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09557v1 Announce Type: new Abstract: In this work, we analyze the ability of NCSN++ U-Net based audio dereverberation models to capture global room characteristics in their intermediate representations. Through an empirical study of both a state-of-the-art diffusion-based model and a discriminative counterpart, we show that deeper layers encode structured room impulse response (RIR)-dependent embeddings. Moreover, the discriminative ability of this implicit room representation correlates with dereverberation performance across objective metrics. Motivated by this observation, we propose a training strategy that explicitly conditions the network on pre-trained RIR embeddings, obtained via self-supervised contrastive learning. Incorporating RIR conditioning improves representation quality, accelerates convergence, and enhances dereverberation performance, while significantly reducing the number of reverse diffusion steps required by the diffusion-based model during inference.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading</title>
  <link>https://arxiv.org/abs/2606.09667</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09667v1 Announce Type: new Abstract: Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation</title>
  <link>https://arxiv.org/abs/2606.09677</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09677v1 Announce Type: new Abstract: While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs</title>
  <link>https://arxiv.org/abs/2606.07643</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07643v1 Announce Type: cross Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models&#39; primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps</title>
  <link>https://arxiv.org/abs/2606.07717</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07717v1 Announce Type: new Abstract: This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The method combines coarse-to-fine segmentation, predictions from multiple anatomical planes, and additional fuzzy 3D spatial maps that provide anatomical location cues to improve segmentation accuracy. We combine multi-planar 2D-U-Net models augmented by a spatial occurrence map. The approach involves two main stages. First, the abdominal volume of interest region is detected by traversing the whole scan axially with a 2D-U-Net and determining the x-y-z-minimum and -maximum extents of the 5 abdominal organs of interest. Second, we use spatial occurrence maps to enhance our multi-planar 2D-U-net architecture inside the bounds from the former stage. The method is evaluated on 80 CT scans from various public sources. The results show Dice improvements of about 4% at maximum compared to the same model trained without spatial occurrence maps.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Programmable Silicon Retina on Pixel Processor Array</title>
  <link>https://arxiv.org/abs/2606.08370</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08370v1 Announce Type: new Abstract: Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13\% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47\%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina&#39;s &quot;information distillation&quot; mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication</title>
  <link>https://arxiv.org/abs/2606.08437</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08437v1 Announce Type: new Abstract: Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domain gap between controlled enrollment and unconstrained authentication. Existing datasets are largely restricted to controlled setups and fail to capture the compound variability of real-world environments. In this paper, we introduce X-Palm, a cross-domain dataset comprising 6,006 palm images from 103 individuals (206 hands). To the best of our knowledge, X-Palm is the first palmprint dataset providing novel paired-identity acquisition specifically designed to bridge the gap between reliably controlled multispectral enrollment and unconstrained mobile authentication while encompassing a broad spectrum of in-the-wild variability. Unlike existing datasets that focus on single to a few variations, X-Palm addresses the massive modality and environmental shifts encountered in practical deployments by capturing paired data for identities across two distinct domains: (1) a controlled Multispectral Palmprint setting using our custom-developed scanner, and (2) an unconstrained smartphone palmprint setting that is participant-driven, incorporating simultaneous variations in hardware, hand pose, illumination, background, camera-to-hand distance, perspective, and palm surface conditions (e.g., moisture and occlusions). Our extensive benchmarks of 12 SOTA models reveal that while existing methods achieve high performance on controlled data, they experience severe performance collapse on X-Palm. Conversely, models trained on X-Palm demonstrate consistent robustness across domains, positioning X-Palm as a valuable resource for training a model towards real-world, cross-domain generalization. Data access instructions and the related benchmarking codes are publicly available at: https://github.com/X-Palm/X-Palm-2026</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Active Source-free Domain Adaptation in Open-set Medical Image Segmentation via Decomposed Uncertainty and Prototype Discrepancy</title>
  <link>https://arxiv.org/abs/2606.08749</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08749v1 Announce Type: new Abstract: Deep learning (DL) methods are challenged to demonstrate robust performance across different segmentation datasets due to domain shifts, but active domain adaptation techniques enhance their generalization performance by querying a few samples from target domains for adaptation training. However in clinical practice, target domains often include private classes of new anatomical structures or pathologies that are not presented in the source data, and existing methods implement closed-set segmentation where source and target domains have the same segmentation classes. Additionally, source data are often inaccessible during adaptation due to strict data privacy regulations. To address these limitations, we propose an Active Source-free Open-set Domain Adaptation (ASFOSDA) method which is the first work to implement active learning for adapting DL models in open-set medical image segmentation without the access to source data. This method employs an active open-set query strategy to select the most informative target samples for training models based on Class-aware Decomposed Uncertainty (CDU) and Class-agnostic Prototype Discrepancy (CPD). CDU measures sample aleatoric uncertainty and model epistemic uncertainty by employing test time augmentation in stochastic processes. CPD measures cross-domain and self-domain discrepancy for selecting diverse samples. Subsequently, to boost the adaptation performance by enhancing training samples, a Target-refined Self-training strategy is proposed to generate high-quality pseudo labels for unselected samples, thus combining them with labeled samples for a semi-supervised training. We evaluated our method on cross-domain open-set volumetric medical image segmentation tasks, and it outperformed state-of-the-art adaptation methods.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Dynamic XR Rendering Offloading Based on Feature-Based Quality Assessment</title>
  <link>https://arxiv.org/abs/2606.09330</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09330v1 Announce Type: new Abstract: Extended Reality (XR) applications demand intensive computation and low latency, especially for real-time rendering tasks. In this letter, we present an edge-aided XR rendering testbed that dynamically offloads rendering workloads between the XR client and the edge server built upon network conditions and latency constraints. The testbed integrates a Microsoft HoloLens 2 headset, a GPU-enabled edge server, and a customized remote rendering toolkit based on the HOLO Stream SDK, enabling seamless switching between local and edge rendering modes in real time. To overcome the limitations of pixel-level quality metrics under head movements and asynchronous frame arrivals, we propose a perceptual evaluation metric based on deep feature embeddings and cosine similarity, which remains robust to spatial and temporal misalignments. Furthermore, we design a contextual bandit learning controller to adapt rendering placement decisions in real time by jointly optimizing perceptual quality and latency. Experimental results demonstrate the feasibility and performance of our testbed, validating its effectiveness in delivering high-quality and interactive XR experiences.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Vendor-agnostic 4D Phase Contrast MRI: a complete open-source pipeline for velocities, displacement, and strain analysis</title>
  <link>https://arxiv.org/abs/2606.09444</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09444v1 Announce Type: new Abstract: Phase contrast MRI (PC MRI) enables quantitative assessment of tissue motion and strain. Although it is increasingly used, standardized, vendor-agnostic pipelines for accelerated acquisitions remain scarce. We present a fully open-source 4D flow PC-MRI pipeline integrating a compressed sensing-accelerated sequence implemented in PyPulseq, BART-based reconstruction, and strain analysis. Additionally, a gradient probing sequence was developed to ensure correct velocity sign assignment across scanner orientations and vendors. The pipeline was validated across two Siemens MRI systems (3T MAGNETOM Prisma and 3T Vida Fit) in two anatomical applications: forearm (Flexor Digitorum Superficialis, n=9) and thigh (Vastus Lateralis, n=10) during Neuromuscular Electrical Stimulation (NMES)-induced contractions. Compressed sensing reduced acquisition times from 35 and 80 minutes to 5 and 11 minutes for the arm and leg acquisitions, respectively. Muscle strain maps and sigmoid-fitted strain curves enabled extraction of peak strain, mean strain, and buildup rate. Strains in the Vastus Lateralis were approximately one order of magnitude higher than in the Flexor Digitorum Superficialis (median peak strain 0.49 vs. 0.063, mean strain 0.31 vs. 0.031). The pipeline demonstrates multi-platform compatibility and provides a reproducible, open framework for quantitative muscle imaging.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Real-Time Industrial Defect Detection on Edge Hardware Using Fine-Tuned YOLOv8: A Systematic Benchmark on the NEU Surface Defect Database and MVTec AD with Automotive &amp; Battery Manufacturing Extensions</title>
  <link>https://arxiv.org/abs/2606.07659</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07659v1 Announce Type: cross Abstract: Automated surface defect detection is critical for ensuring rigorous quality control in high-speed manufacturing environments. While deep learning models offer remarkable accuracy, deploying them on resource-constrained edge hardware without introducing significant latency remains a persistent challenge. This paper presents Industrial-YOLO, an edge-optimized framework built upon a fine-tuned YOLOv8 architecture specifically engineered for real-time industrial defect detection. We conduct a systematic benchmark utilizing the NEU surface defect database for steel sheets and the MVTec AD dataset, supplemented with custom automotive manufacturing extensions representing real-world structural anomalies (scratches, pits, and inclusions). To bridge the gap between algorithmic complexity and edge hardware constraints, target-specific optimizations are introduced via TensorRT and OpenVINO acceleration engines. Experimental results demonstrate that Industrial-YOLO achieves a high-velocity inference speed exceeding 120 FPS on the NVIDIA Jetson Orin platform while maintaining an exceptional mean Average Precision (mAP) of 98.5%. The proposed framework showcases highly robust, zero-latency performance when deployed directly onto an active automotive assembly line, offering a scalable blueprint for next-generation automated optical inspection (AOI) systems.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>LEGS: Laplacian-Enhanced Gaussian Splatting with a Nonlinear Weighted Loss</title>
  <link>https://arxiv.org/abs/2606.07932</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07932v1 Announce Type: cross Abstract: 3D Gaussian Splatting (3DGS) has become an efficient explicit representation for radiance field reconstruction and real-time novel view synthesis. However, its standard photometric loss treats flat and structure-rich regions similarly, which may limit the recovery of sharp contours and fine details. Edge-Guided Gaussian Splatting (EGGS) improves structure awareness through edge-guided weighting, but mainly relies on first-order gradient responses and linear weighting. In this paper, we propose LEGS, a Laplacian-Enhanced Gaussian Splatting method with a nonlinearly weighted loss. LEGS replaces first-order gradient guidance with second-order Laplacian structural guidance and maps the normalized Laplacian response into pixel-wise weights through nonlinear response-to-weight functions. The proposed loss improves structure-aware Gaussian optimization while keeping the original 3DGS rendering pipeline unchanged. Experiments on the full Tanks\&amp;Temples and Mip-NeRF360 datasets show that LEGS improves peak signal-to-noise ratio (PSNR) by up to 1.68 dB over 3DGS and up to 0.52 dB over EGGS. Incorporating the proposed second-order nonlinear weighting strategy into FastGS and FasterGS further improves PSNR by up to 1.69 dB, demonstrating its effectiveness as a general loss-level extension for Gaussian Splatting pipelines with potential applications in AR/VR, immersive visualization, and real-time 3D content generation.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>DAL-PCQA: Enabling Distortion-Level and Language-Driven Reasoning for Point Cloud Quality Assessment</title>
  <link>https://arxiv.org/abs/2606.07938</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07938v1 Announce Type: cross Abstract: Point Cloud Quality Assessment (PCQA) methods typically predict scalar Mean Opinion Scores (MOS), which quantify overall perceptual degradation but do not reveal its causes. In contrast, human observers naturally reason in terms of specific distortions such as blur, color shifts, point density changes, missing regions, and geometric deformations. To close this gap, we introduce DAL-PCQA, a distortion-aware, language-annotated dataset for PCQA. DAL-PCQA augments benchmark point clouds with multi-level distortion severity labels, discrete quality categories, and structured natural language descriptions aligned with human perception. We define a point-cloud-specific distortion taxonomy that covers both photometric and geometric artifacts. Statistical analysis reveals characteristic degradation patterns across distortion types and quality levels. To assess the utility of these annotations, we compare zero-shot and fine-tuned multimodal models for generating perceptual quality descriptions. Experiments show that distortion-aware supervision substantially improves lexical and semantic alignment with ground-truth descriptions. By enabling interpretable, distortion-level reasoning, DAL-PCQA facilitates language-driven, explainable point cloud quality assessment. The dataset is publicly available at https://github.com/swarna96/DAL-PCQA.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Feasibility to detect rapid change and disappearance of seagrass: Lessons from nearly 80 years of vegetation change in the Ako, Seto Inland Sea, Japan</title>
  <link>https://arxiv.org/abs/2606.07949</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07949v1 Announce Type: cross Abstract: This study analyses the Ako tidal flat in the Seto Inland Sea, Japan, where nearly all Zostera marina disappeared within a single year in 2025. Using aerial photographs from the 1940s onward, high-resolution satellite imagery, GRUS images (2.5-5 m), and monthly Sentinel-2 composites (10 m), we reconstructed approximately 80 years of seagrass distribution. YOLO-based segmentation using deep learning achieved high accuracy (overall accuracy &gt;= 0.9) across these datasets; although species could not be discriminated, the models captured the major temporal dynamics in vegetation area. The long-term mean seagrass area was 6.8 ha, but values fluctuated widely, from 3.5 ha in 1974 to 41.3 ha in 1989 except 0.2 ha in 2025. Sentinel-2 composites from 2019 to 2026 revealed clear seasonality, with vegetation increasing in early summer and declining from autumn. In 2025, however, the area decreased sharply after summer and remained anomalously low throughout the winter of 2025-2026. Our results, indicating that the 2025 event was not a normal fluctuation but a rapid ecosystem shift involving the loss of the dominant canopy-forming species, most plausibly driven by regionally elevated summer water temperatures. The findings also have implications for seagrass Essential Ocean Variables (EOVs) and the State of Nature (SoN) metrics used in TNFD-aligned nature-related disclosures. Unlike forests, seagrass meadows require finer temporal resolution because both pronounced seasonality and abrupt collapse strongly influence area-based indicators. Therefore, in addition to previously noted issues such as species-level classification accuracy, we recommend that (1) baselines be defined over the longest available record and justified ecologically, (2) seasonal standardization be applied before inter-annual comparisons, and (3) years with extreme area anomalies be flagged rather than used as reference points.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>SPIRONet: Spatial-Frequency Learning and Graph-based Channel Interaction Network for Vessel Segmentation</title>
  <link>https://arxiv.org/abs/2406.19749</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2406.19749v2 Announce Type: replace Abstract: Automatic vessel segmentation plays a pivotal role in the development of next-generation interventional navigation systems for surgical robotics. However, current approaches still suffer from suboptimal segmentation performance under challenging intraoperative conditions, such as low-signal-to-noise ratio (SNR), small or slender vessels, and strong interference. In this study, a novel spatial-frequency learning and graph-based channel interaction network (SPIRONet) is proposed to address the above issues. To address low-SNR vessel appearance and small or slender branches, dual spatial-frequency encoders are utilized, where the frequency encoder captures global vessel continuity that is less affected by local noise fluctuations, while the spatial encoder preserves fine vessel details. A cross-attention fusion module is further introduced to adaptively integrate this complementary spatial and frequency information. Moreover, to suppress interference from non-target vessels and vessel-like structures, a graph-based channel interaction module is designed to model channel-wise correlations, enhancing consistent vessel-related responses while suppressing task-irrelevant activations. Extensive experimental results on five challenging datasets demonstrate that the proposed method achieves competitive and consistently strong performance compared with existing methods. For example, SPIRONet achieves IoU improvements of +0.87%, +0.52%, +0.23%, +1.39%, and +2.22% over the strongest competing methods on CADSA, CAXF, DCA1, XCAD, and ARCADE, respectively. Moreover, SPIRONet achieves an inference speed of 21 FPS with a 512x512 input size, meeting the real-time requirements of interventional scenarios (6-12 FPS). These promising results indicate SPIRONet&#39;s potential for integration into interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A generalizable 3D framework and model for self-supervised learning in medical imaging</title>
  <link>https://arxiv.org/abs/2501.11755</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2501.11755v2 Announce Type: replace Abstract: Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation</title>
  <link>https://arxiv.org/abs/2511.18493</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2511.18493v4 Announce Type: replace Abstract: The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23% on EBHI, DSC scores of 92.78% and 91.42% on GlaS Test A and Test B, respectively, and 91.26% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning. Project page: https://oxyzgiahuy.github.io/sage/</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Solving Inverse Problems with Flow-based Models via Model Predictive Control</title>
  <link>https://arxiv.org/abs/2601.23231</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2601.23231v2 Announce Type: replace Abstract: Flow-based generative models provide strong unconditional priors for inverse problems, but guiding their dynamics for conditional generation remains challenging. Recent work casts training-free conditional generation in flow models as an optimal control problem; however, solving the resulting trajectory optimisation is computationally and memory intensive, requiring differentiation through the flow dynamics or adjoint solves. We propose MPC-Flow, a model predictive control framework that formulates inverse problem solving with flow-based generative models as a sequence of control sub-problems, enabling practical optimal control-based guidance at inference time. We provide theoretical analysis linking MPC-Flow to the underlying optimal control objective and show how different algorithmic choices yield a spectrum of guidance algorithms, including regimes that avoid backpropagation through the generative model trajectory. We evaluate MPC-Flow on benchmark image restoration tasks, spanning linear and non-linear settings such as in-painting, deblurring, and super-resolution, and demonstrate strong performance and scalability to massive state-of-the-art architectures via training-free guidance of FLUX.2 (32B) in a quantised setting on consumer hardware.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing</title>
  <link>https://arxiv.org/abs/2605.14285</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.14285v2 Announce Type: replace Abstract: Data assimilation (DA) estimates the state of an evolving dynamical system from noisy, partial observations, and is widely used in scientific simulation as well as weather and climate science. In practice, filtering methods rely on frame-to-frame transition models. However, these models are fragile when observations are non-Markovian (when they form only a partial slice of a higher-dimensional latent state as in real-world weather data): they tend to accumulate errors over long horizons. At the same time, learned DA methods typically commit to a single regime, either filtering (nowcasting, real-time forecasting) or smoothing (retrospective reanalysis), which splits what should be a shared prior across application-specific pipelines. To address both issues, we introduce ForcingDAS, a unified and robust DA framework. Built on Diffusion Forcing with an independent noise level assigned to each frame, ForcingDAS learns a joint-trajectory prior instead of frame-to-frame transitions. This allows it to capture long-horizon temporal dependencies and reduce error accumulation. In addition, the same trained model spans the full filtering to smoothing spectrum at inference time. Specifically, nowcasting, fixed-lag smoothing, and batch reanalysis are selected through the inference schedule alone, without retraining. We evaluate ForcingDAS on 2D Navier-Stokes vorticity, precipitation nowcasting, and global atmospheric state estimation. Across all settings, a single model is competitive with or outperforms both learned and classical baselines that are specialized for individual regimes, with the largest gains observed on real-world weather benchmarks.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs</title>
  <link>https://arxiv.org/abs/2406.07318</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2406.07318v3 Announce Type: replace-cross Abstract: The utilisation of event cameras represents an important and swiftly evolving trend aimed at addressing the constraints of traditional video systems. Particularly within the automotive domain, these cameras find significant relevance for their integration into embedded real-time systems due to lower latency and power consumption. One effective approach to ensure the necessary throughput and latency for event processing is through the utilisation of graph convolutional networks (GCNs). In this study, we introduce a custom EFGCN (Event-based FPGA-accelerated Graph Convolutional Network) designed with a series of hardware-aware optimisations tailored for PointNetConv,a graph convolution designed for point cloud processing. The proposed techniques result in up to 100-fold reduction in model size compared to Asynchronous Event-based GNN (AEGNN), one of the most recent works in the field, with a relatively small decrease in accuracy (2.9% for the N-Caltech101 classification task, 2.2% for the N-Cars classification task), thus following the TinyML trend. We implemented EFGCN on a ZCU104 SoC FPGA platform without any off-chip external memory resources, achieving a throughput of 13.3 million events per second (MEPS) and real-time partially asynchronous processing with low latency. Across multiple event-based classification benchmarks, our approach achieves competitive accuracy while providing state-of-the-art computational efficiency per event, small model size, and high scalability, customisability and resource efficiency. We publish both software and hardware source code in an open repository: https://github.com/vision-agh/gcnn-dvs-fpga.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks</title>
  <link>https://arxiv.org/abs/2603.26763</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.26763v2 Announce Type: replace-cross Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($\eta_p^2 = .112$) and encoder$\times$content condition ($\eta_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A Vision-language Framework for Comparative Reasoning in Radiology</title>
  <link>https://arxiv.org/abs/2606.06407</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06407v2 Announce Type: replace-cross Abstract: Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Hessian-matching Based Weighting for Attitude Determination Using Short-Range DoA Measurements with IMU Assistance</title>
  <link>https://arxiv.org/abs/2606.07719</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07719v1 Announce Type: new Abstract: Accurate and reliable attitude determination (AD) is essential for unmanned vehicles operating in Global Navigation Satellite System (GNSS)-denied environments. Short-range wireless arrays can provide direction-of-arrival (DoA) measurements from multiple anchors, enabling AD by aligning corresponding direction vectors (DVs) expressed in the body and navigation frames. In short-range scenarios, navigation-frame DVs inherit non-negligible uncertainty induced by anchor/vehicle position errors in addition to DoA-induced errors in body-frame DVs. Moreover, due to projection and unit-norm normalization, the DV errors are generally anisotropic, which motivates a total least squares (TLS) viewpoint. This paper identifies the key modeling distinction in short-range AD, develops a TLS-consistent formulation based on the total DV error and solves the resulting covariance-weighted orthogonal Procrustes problem via a manifold Gauss--Newton method. To retain the efficiency and numerical robustness of the closed-form weighted Wahba solution, we further propose Hessian-matching based scalar weighting strategies that approximate the Hessian of Wahba formulation to the TLS formulation, including a full-attitude strategy for overall accuracy and a direction-of-interest (DOI) strategy for prioritizing a selected attitude component. Finally, we incorporate IMU-derived gravity as an additional DV pair for static initialization, leading to extended Wahba and extended TLS formulations. Simulation results demonstrate that the proposed Hessian-matching weighting improves accuracy and robustness compared with existing baselines, and that gravity-DV augmentation further reduces attitude errors and improves solution availability under limited anchor availability.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Optimal Wiener-Filter Solutions for Denoising of Graph Signals on Directed Graphs</title>
  <link>https://arxiv.org/abs/2606.07876</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07876v1 Announce Type: new Abstract: Graph signal processing has opened new avenues to the canonical denoising problem in interesting settings. Specifically, here we propose a Wiener-filter solution for graph signals on directed graphs. Under various stationarity assumptions combining uncorrelated and correlated noise conditions, we show optimal solutions, including a successful proof-of-concept for temperature graph.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Double Proportionate Sparse Adaptive Filter for Impulsive Noise Environments</title>
  <link>https://arxiv.org/abs/2606.08225</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08225v1 Announce Type: new Abstract: Sparse adaptive filters and impulsive noise robust algorithms have largely been developed along separate tracks, leaving a gap when both properties are needed simultaneously. This letter proposes the double proportionate sparse adaptive filter (DP-SAF), which closes this gap within a single $\mathcal{O}(M)$ update. Two independent diagonal gain matrices are introduced; one scales the adaptation step proportionately to coefficient magnitudes, and the other applies a magnitude-dependent zero-attraction that is strongest for inactive taps. A sign-error update provides robustness against impulsive corruptions. Both gain matrices are derived from a minimum-norm optimization framework. Simulations under a Bernoulli impulsive noise model show that DP-SAF consistently achieves a better steady-state MSD than the competing algorithms while matching or exceeding their convergence speeds.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>CG-MambaNet: A spatiotemporal framework for cross-patient epileptic seizure prediction using CNN-GCN-Mamba-BiLSTM with event-level clinical evaluation</title>
  <link>https://arxiv.org/abs/2606.08226</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08226v1 Announce Type: new Abstract: Epileptic seizure prediction from scalp EEG is critical for closed-loop neurostimulation therapy. Existing deep-learning methods share two architectural limitations: they model EEG channels independently, neglecting inter-channel spatial synchrony, and process raw time-domain samples without frequency decomposition. A methodological limitation also affects the field: most studies use data splits that permit patient-level information leakage, yielding optimistic estimates that do not generalise to unseen patients. We present CG-MambaNet, a spatiotemporal seizure prediction framework addressing all three limitations. A depthwise separable CNN front-end decomposes each EEG patch into multi-scale spectro-temporal features, capturing delta-to-gamma band dynamics before sequence modelling. A two-layer graph convolutional network with a learnable adjacency matrix captures inter-channel functional synchrony without montage-specific coordinates, applicable to bipolar (CHB-MIT) and referential (SIENA) montages. A bidirectional Mamba encoder followed by a bidirectional LSTM models long- and short-range temporal dynamics, and a two-layer MLP produces the final seizure probability. This serial hierarchy ensures frequency decomposition precedes spatial mixing, which precedes temporal integration. Under strict leave-one-patient-out cross-validation with five independent random seeds, CG-MambaNet achieves AUC-ROC of 0.8152+/-0.0176 on CHB-MIT (n=22) and 0.7104+/-0.0261 on SIENA (n=6), surpassing all published cross-patient methods without domain adaptation. An event-level evaluation framework merging consecutive alarmed windows via a persistence filter reduces false predictions to 0.32 alarms/hour on CHB-MIT, demonstrating clinically meaningful alarm burden.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A dual-system approach for epilepsy diagnosis: integrating mamba-Bi-LSTM architecture with SHAP-based verification</title>
  <link>https://arxiv.org/abs/2606.08240</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08240v1 Announce Type: new Abstract: This study develops a medical AI-assisted diagnosis system based on deep learning, which provides intelligent diagnostic solutions for epilepsy, a disease that seriously threatens the life and health of patients. Epilepsy has sudden and unpredictable seizures. Traditional diagnostic methods mainly rely on doctors&#39; manual interpretation of EEG, which is time-consuming and dependent by experience. In response to the above challenges, this study designed a dual-system intelligent diagnosis framework, which includes two core components: the main discrimination system and the verification system. The main discrimination system uses a deep learning model that combines the innovative Mamba architecture with the Bi-LSTM structure to integrate and analyze heterogeneous data to achieve extremely high diagnostic accuracy; the verification system provides an explainable diagnostic basis through the SHAP method to enhance the credibility of the results. This system establishes a cross-modal database to realize intelligent analysis of multi-source heterogeneous data-fusion EEG signals and clinical text data for epilepsy. The system outputs results based on diagnostic consistency and confidence levels, and high-confidence predictions can also be used as automatic feedback sources to optimize the model. The experimental results show that the accuracy of the main discriminant model of the intelligent diagnosis system for epilepsy has increased from 92.6% to 98.7% and the F1 score has increased from 0.895 to 0.992, all of which have exceeded the existing optimal methods; the average processing time for verification system feedback integration is only 220 ms, which increases the overall diagnostic accuracy by 5.1%.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Spatio-Sequential Recurrent Network for 3-D Tunnel Propagation Modeling</title>
  <link>https://arxiv.org/abs/2606.08246</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08246v1 Announce Type: new Abstract: Fine-mesh parabolic wave equation (PWE) simulations are high-fidelity but time-consuming, which limits real-time tunnel propagation analysis and motivates coarse-to-fine reconstruction. Existing machine learning (ML)-assisted tunnel models typically provide only one-dimensional (1-D) longitudinal refinement or two-dimensional (2-D) cross-sectional refinement, rather than joint 3-D enhancement. Motivated by this gap, this letter proposes a U-shaped gated spatio-sequential recurrent neural network (UG-SSRNN), a spatio-sequential reconstruction model for tunnel electromagnetic fields. UG-SSRNN jointly super-resolves transverse slices and models longitudinal evolution. It uses sliding-window context encoding and a K-layer convolutional recurrent backbone with a shared propagation-context state and diagonal feedback. A prediction-aware upsampling head leverages the previous prediction to improve slice-to-slice consistency. Experiments on four tunnel cross sections, unseen-material and unseen-frequency tests, and validation in the Massif Central tunnel show close agreement with fine-mesh PWE references. The proposed approach significantly reduces tunnel electromagnetic modeling time.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Feedforward Nonlinear Equalizer for Short- to Medium-Reach Wireline Links</title>
  <link>https://arxiv.org/abs/2606.08313</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08313v1 Announce Type: new Abstract: This paper presents a feedforward nonlinear equalizer (FFNE) framework for short- to medium-reach wireline links that removes the feedback-timing bottleneck of decision-feedback equalizers (DFEs) while approaching the noise-margin advantage within a characterized operating region. The proposed FFNE reduces short-window maximum-likelihood sequence estimation to a compact binary decision rule, enabling a low-complexity feedforward realization without transmitter-side encoding. For the single-postcursor NRZ case, the mathematical foundation, hardware implementation, tap adaptation, statistical analysis, and equalization limit relative to an ideal 1-tap DFE are established. A window-length-3 FFNE quantifies the performance-complexity tradeoff of longer sequence windows. The framework is further extended to PAM-4 modulation and simultaneous precursor/postcursor equalization through a pattern-detection-based FFNE (PD-FFNE), which outperforms conventional FFE+DFE baselines under representative channel conditions.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Enhanced Wide-Angle Steering with Multi-Mode Multi-Port Aperture Antenna Arrays</title>
  <link>https://arxiv.org/abs/2606.08320</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08320v1 Announce Type: new Abstract: A novel concept for wide-angle scanning is proposed based on multi-mode multi-port antennas. The theory of multi-mode multi-port antennas based on aperture radiators is developed and applied towards the design of an antenna array consisting of multi-mode aperture radiators. An advanced beamforming algorithm is developed and implemented, making use of the higher degrees of freedom available to multi-mode multi-port antennas. The manufactured antenna array is measured and compared to the expected performance. Wide-angle steering up to $\pm77^\circ$ from broadside with respect to a scan loss of $3\,\mathrm{dB}$ is achieved in both the horizontal and vertical plane with no visible grating lobe.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Switching Beamformer for Highly Non-Stationary Environments</title>
  <link>https://arxiv.org/abs/2606.08385</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.08385v1 Announce Type: new Abstract: Adaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Mixture-of-Experts Transformer for Automatic Modulation Recognition</title>
  <link>https://arxiv.org/abs/2606.09085</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09085v1 Announce Type: new Abstract: Automatic Modulation Recognition (AMR) is a key enabling technology for cognitive radio and intelligent spectrum management in next-generation wireless systems. However, current deep learning-based AMR methods predominantly rely on static multi-scale fusion strategies, which lack the flexibility to adapt to the highly dynamic temporal variations of modulation signals. To address this limitation, we propose MoEformer, an adaptive Multi-Scale Mixture-of-Experts Transformer network that directly processes I/Q signals to preserve their temporal and phase structures. Specifically, MoEformer constructs multi scale expert views through temporal resampling, employs an input-dependent gating mechanism for dynamic expert fusion, and integrates Rotary Position Embeddings (RoPE) within Transformer encoders to capture both local and global tem poral dependencies. Comprehensive evaluations on three widely adopted benchmarks (RadioML2016.10a, RadioML2016.10b, and RadioML2018.01A) demonstrate that MoEformer outperforms the competitive baselines, achieving superior average recognition accuracies of 63.74%, 66.24%, and 64.22%, respectively. In addition, the proposed method strikes an optimal trade-off between recognition performance and model complexity.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>RSMA Technique for Multi-User Downlink Single-Waveguide Multi-Pinching Antenna Systems</title>
  <link>https://arxiv.org/abs/2606.09095</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09095v1 Announce Type: new Abstract: Pinching antennas have recently emerged as a promising technology for reconfigurable wireless systems due to their ability to dynamically radiate signals from flexible positions along a waveguide. This letter investigates a multi-user communication framework by integrating rate-splitting multiple access (RSMA) into a single-input single-output (SISO) single-waveguide architecture equipped with multiple pinching antennas. Multiple antennas are activated along a shared waveguide to radiate a common guided signal toward distributed users, enabling strong near-field line-of-sight (LoS) links with low hardware complexity and a single radiofrequency (RF) chain. To manage multi-user interference, RSMA is employed within the proposed architecture. Simulation results show that the proposed framework improves system sum-rate, enhances user rate fairness, and achieves lower bit error rate (BER) while preserving the low-cost and scalable characteristics of pinching antenna systems (PASS).</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Towards Intelligent Wireless Networks: The Synergy of Generative AI and Digital Twins</title>
  <link>https://arxiv.org/abs/2606.09113</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09113v1 Announce Type: new Abstract: This paper proposes a generative AI (GenAI)-enabled digital twin (DT) framework for proactive and energy-aware wireless optimization in future 6G ecosystems. Most existing AI-assisted DT approaches remain fundamentally reactive, adjusting network parameters only after performance degradation occurs or restricting GenAI to isolated signal-level tasks such as channel estimation. This work adopts a proactive approach. Instead of responding to problems after they appear, the proposed framework continuously synchronizes channel states, mobility dynamics, traffic conditions, and energy information within a real-time DT environment, enabling the system to anticipate congestion, interference, and energy demand before they materialize. The result is a closed-loop proactive architecture that operates at the system level, jointly managing communication, mobility, and resource dynamics for autonomous wireless control. Evaluations on a UAV-assisted non-terrestrial network (NTN) scenario show approximately 69.2\% energy savings over reactive baselines while maintaining reliable quality-of-service (QoS) under dense and mobility-intensive conditions. Beyond this specific scenario, the framework offers a scalable foundation for broader AI-native 6G applications, including aerial platforms, autonomous systems, extended reality (XR), industrial automation, and space-air-ground-sea (SAGS) integrated infrastructures.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Block-Term Decomposition Approach to Blind Multi-trial Functional Ultrasound Unmixing</title>
  <link>https://arxiv.org/abs/2606.09264</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09264v1 Announce Type: new Abstract: Functional ultrasound (fUS) has emerged as a powerful neuroimaging modality due to its high resolution in both space and time, low cost and potential portability. Nevertheless, fUS signals provide only indirect observations of neuronal activity through the neurovascular coupling, and hence require the blind separation of latent neuronal sources while also deconvolving their hemodynamic responses. In this work, we propose a data-driven convolutive block-term tensor decomposition-based model for multi-trial fUS measurements, where each source has a spatiotemporal representation comprising a low-rank spatial map and a piecewise-constant neuronal activation signal convolved with a trial- and source-dependent hemodynamic response function (HRF) with a physiologically plausible shape. We propose a constrained optimization framework for the model computation, which consists of alternating projected gradient descent iterations. Simulation results are reported that demonstrate accurate recovery of spatial maps and reliable estimation of activation temporal profiles across various noise levels, while confirming that HRF estimation remains the most challenging part of the problem.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Wearable Single-Lead ECG Detects Fine-Grained Structural Heart Disease Through Echo-Report Supervision</title>
  <link>https://arxiv.org/abs/2606.09332</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09332v1 Announce Type: new Abstract: Structural heart disease (SHD) is a primary driver of heart failure and cardiovascular mortality, yet early detection remains constrained by the limited accessibility of echocardiography. While single-lead electrocardiogram (ECG) is ubiquitous through wearables, existing AI screening models often depend on 12-lead inputs, generalize poorly across institutions, or require massive, condition-specific labeled datasets. Recent work has demonstrated the feasibility of contrastive pre-training between single-lead ECGs and echocardiography reports within a single health system. Here, we present AnyECG-Echo, a framework that advance this paradigm toward clinical translation through three key developments: (1) evaluation in a geographically independent external cohort (n = 16,621); (2) diagnostic coverage of 13 fine-grained SHD subtypes spanning myocardial, chamber, valvular, and great-vessel pathologies; and (3) dual-axis mechanistic interpretability combining electrophysiology-grounded Shapley attribution with emergent correlations to quantitative measurements. Across validation cohorts totaling n = 25,222, the model demonstrated high AUROC for high-impact subtypes, including reduced left ventricular systolic function (AUROC 0.866-0.924), global heart enlargement (0.877-0.931), and mitral stenosis (0.836-0.906). Furthermore, we successfully validated the alignment of model outputs with established medical physiological traits, thereby enhancing interpretability. Notably, we discovered that AnyECG-Echo&#39;s outputs function as physiologically grounded digital biomarkers that accurately track objective metrics such as LVEF and myocardial wall thickness. These findings prove that wearable single-lead ECGs can effectively detect fine-grained structural heart disease, offering a practical solution for population-scale screening.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Orbital Plane Geometry and Information Conditioning for Doppler-Only LEO Positioning</title>
  <link>https://arxiv.org/abs/2606.09496</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09496v1 Announce Type: new Abstract: We study an idealized information model for Doppler-only positioning with low earth orbit (LEO) signals of opportunity from a stationary receiver. Motivated by the observation that Doppler measurements from a satellite pass provide information primarily within the associated orbital plane, we model each satellite contribution as a weighted projection onto that plane. Under this model, the combined information matrix from multiple satellites is a sum of orbital-plane projection operators. Closed-form expressions are derived for the eigenvalues, condition number, and worst-case Cramer-Rao lower bound. For two satellites, the conditioning is governed by the dihedral angle between orbital planes and the relative information strengths of the two links. Monte Carlo evaluation of pass-integrated Doppler Fisher information matrices demonstrates that the proposed surrogate captures the dominant conditioning trends associated with orbital-plane diversity. The results provide a simple geometric framework for understanding the role of constellation geometry in Doppler-only positioning systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Hierarchical Federated Learning for Unsupervised Waveform Classification over Tactical MANETs</title>
  <link>https://arxiv.org/abs/2606.09504</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09504v1 Announce Type: new Abstract: Distributed radio frequency sensing in contested tactical environments demands collaborative learning across mobile nodes. In ad-hoc networks, learning must occur without persistent backhaul, ground truth labels, or reliable communication links. Traditional federated learning approaches assume either ideal link conditions or supervised training objectives, neither of which holds in practice for deployed MANET platforms. This paper presents a hierarchical federated learning framework for unsupervised waveform classification over tactical MANETs subject to Rayleigh fading, random waypoint mobility, and multi-hop routing loss. Each node trains a local denoising convolutional autoencoder on raw IQ observations without label exchange, learning compact representations through a self-supervised reconstruction objective. A two-stage aggregation protocol elects connectivity-based relay aggregators consistent with OLSR multipoint relay selection, compressing cluster-level model updates before forwarding to a mobile server proxy. Simulation results demonstrate that in-network aggregation reduces attempted transmission bits relative to relay-forward federated averaging by around 12% at equivalent classification performance. Notably, stochastic channel-driven subsampling under non-IID data acts as an implicit regularizer, with both MANET conditions matching or exceeding ideal federated averaging on unsupervised representation quality. This suggests that moderate link loss can partially compensate for client drift in heterogeneous networks. Performance is assessed on analysis of the learned latent embeddings using KMeans normalized mutual information and linear probe accuracy.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bernoulli Filtering for Multi-Sensor Tracking with Thresholded Measurements</title>
  <link>https://arxiv.org/abs/2606.09573</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09573v1 Announce Type: new Abstract: Target tracking is challenging when sensor detection thresholds cause state-dependent missed detections, particularly in multi-sensor scenarios with clutter and uncertain target existence. A recently developed missed detection framework models detection probability as a function of target state, sensor characteristics, and detection threshold, but it is limited to individual measurements and does not address the recursive tracking problem. This work extends the framework using a Bernoulli filter formulation to jointly handle recursive target tracking, clutter, and target existence uncertainty. A Bernoulli particle filter is evaluated in a simulated 2D multi-sensor tracking scenario with nonlinear measurements, clutter, and detection uncertainty. Incorporating accurate detection threshold knowledge reduces the generalized optimal subpattern assignment (GOSPA) metric by 62.4% compared to a conventional Bernoulli filter with fixed detection probability, while better balancing missed detections and false alarms.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Throughput Analysis for Near-Field Mobile Communications: Beamfocusing or Caustic Beamforming?</title>
  <link>https://arxiv.org/abs/2606.09652</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09652v1 Announce Type: new Abstract: The migration to the Terahertz (THz) band and the deployment of extremely large antenna arrays (ELAAs) are transitioning wireless communications into the radiative near-field regime, fundamentally evolving conventional angular beam steering to beamfocusing (BF). However, the combination of the extremely narrow beamwidth and the mobility of the users necessitates frequent beamfocusing reconfigurations, incurring a significant switching overhead that degrades the system achievable throughput. In this regard, caustic beamforming (CB) is a promising alternative based on the synthesis of a continuous curved beam, which eliminates the need for beam tracking at the expense of a distributed beamforming gain. By leveraging the Airy beam as a canonical model, this paper develops an analytical framework to compare the throughputs achieved by CB and BF. Our main results include closed-form throughput expressions for both beamforming strategies and a performance boundary for paradigm selection. First, we derive the BF throughput by modeling a defocusing penalty induced by continuous user movement. The optimal beam dwell time that maximizes the throughput is analytically determined, and the impact of user speed and switching overhead on the throughput is quantified. For the CB scheme, we demonstrate that its throughput is determined by the signal-to-noise ratio (SNR) and the geometry of the trajectory of the user, yet invariant to the user speed. Finally, we analytically establish a threshold for the switching overhead to define the crossover point of the achievable throughput of both beamformers. Crucially, this threshold asymptotically vanishes at extremely high frequencies, positioning the continuous CB scheme as the preferred beam design paradigm for high-mobility THz communications.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Jamming-Resilient Sparse Delay-Doppler NOMA: Unitary Precoding, Randomized Active Sets, and Superincreasing Power Allocation</title>
  <link>https://arxiv.org/abs/2606.09753</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09753v1 Announce Type: new Abstract: We propose a sparse delay-Doppler NOMA scheme resilient to intentional jamming. The transmitter places user data on a small random subset of delay-Doppler bins, spreads the result through a unitary precoder, and re-draws the active subset per frame from a pseudo-random seed shared with the receiver. The receiver detects and discards jammed bins, recovers the sparse signal by least squares, and decodes per bin via SIC. Hadamard, DFT, and Haar-random precoders all yield essentially the same BER, because a Marchenko-Pastur conditioning argument controls any random unitary submatrix. The closed-form BER has no jammer-induced floor, unlike the well-known partial-band floor of conventional OTFS-NOMA. The same argument shows that compromising the shared seed does not break the system: random unitary submatrices remain well-conditioned, so BER stays within the unjammed envelope. For more than two users we use a superincreasing power allocation (Merkle-Hellman knapsack) and prove the resulting low-complexity SIC matches maximum-likelihood detection exactly, removing the usual SIC propagation ceiling. For more than four users we partition them into pairs assigned to disjoint bin subsets; this OMA-friendly NOMA rule reaches floor BER at eight users by SNR around 20 dB. We extend the framework to Rician fading and show the jammer-independence property holds for arbitrary Rician K-factor. Monte Carlo simulations track the analytical predictions within 3 dB and show at least a 40 dB BER-ratio improvement against pattern-aware jammers, with about 24 dB of cumulative gain over conventional OTFS-NOMA under oracle jamming.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Adaptive Derivative Estimation via Stein&#39;s Unbiased Risk</title>
  <link>https://arxiv.org/abs/2606.09829</link>
  <pubDate>Tue, 09 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.09829v1 Announce Type: new Abstract: Estimating derivatives from noisy sampled data is fundamental to control, human--computer interaction, and biomedical engineering. Causal FIR derivative filters offer a natural approach for this challenge, yet their performance depend on their length. While short filters amplify noise, long filters introduce smoothing bias. We present SURDE (SURE Derivative Estimator), which addresses this tradeoff at each time step by evaluating a data-driven cost derived from Stein&#39;s Unbiased Risk Estimator (SURE) across a bank of candidate lengths and soft-combining their outputs via exponential weighting. We prove a minimax-optimal oracle inequality for the soft-combined estimator and use it to derive the optimal weighting temperature in closed form. Thus, the only tuning parameter for SURDE is the noise variance. Via numerical simulations we show that SURDE consistently outperforms alternative adaptive methods (the Intersection of Confidence Intervals (ICI) rule and the Adaptive Windowing Velocity Estimator (AWVE)) for first-derivative estimation. We further show that \surede{} is robust to noise-variance misspecification (9\% degradation over a $4\times$ range), and that it is superior to ICI and AWVE also over real data scenarios (the EuRoC MAV dataset). SURDE is causal, computationally light, and requires only a rough estimate of the noise variance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>NL-COMM-Sat: Breaking the Direct Device-to-Satellite Communication Barrier via &quot;Aggressive&quot; Non-Orthogonal Transmissions and Non-Linear Processing</title>
  <link>https://arxiv.org/abs/2604.24453</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.24453v3 Announce Type: replace Abstract: Direct Device-to-Satellite (D2S) communications, which enable direct satellite connectivity with unmodified user equipment (UE), not only expand global coverage but also reshape the evolution of future access networks. However, D2S links face fundamental challenges due to inherently low signal-to-noise ratios (SNRs) and limited spatial multiplexing gains arising from near line-of-sight propagation, both of which severely constrain achievable spectral efficiency. Despite the lack of spatial multiplexing, this work shows that aggressive non-orthogonal transmissions, where multiple users (e.g., four) transmit concurrently over the same frequency resources, even to a single receive antenna, can unlock substantial capacity gains that remain entirely unexploited by existing systems. Realizing these gains in practice, however, requires receiver architectures that, to the best of our knowledge, have not yet been developed. To this end, we introduce NL-COMM-Sat, an efficient and flexible framework that overcomes this limitation by enabling aggressive non-orthogonal signal transmissions. In contrast to conventional non-orthogonal multiple access (NOMA) schemes, NL-COMM-Sat supports more than two UEs per receive antenna on the same frequency resource. The framework revisits optimal receiver design principles and proposes computationally efficient processing schemes that translate previously unexplored theoretical gains into tangible throughput improvements, even under realistic channel estimation errors and high-mobility Doppler conditions. Our evaluation shows that NL-COMM-Sat achieves up to a 2x increase in spectral efficiency compared to orthogonal multiple access and NOMA baselines across all considered SNR and Doppler regimes, even with a single-antenna receiver and user speeds of up to 500 km/h.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>ORIX: Orchestration of RIS with xApps for Smart Wireless Factory Environments</title>
  <link>https://arxiv.org/abs/2510.17462</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.17462v2 Announce Type: replace Abstract: The vision of a smart wireless factory (SWF) demands highly flexible, low-latency, and reliable connectivity that goes beyond conventional wireless solutions. Reconfigurable intelligent surface (RIS)-empowered communications, when integrated with the open radio access network (O-RAN) architectures, have emerged as a promising enabler to meet these challenging requirements. This article introduces the methodology for the orchestration of RIS with xApps (ORIX), bringing the RIS technology into the O-RAN ecosystem through xApp-based control for SWF environments. ORIX features three key components: an O-RAN-compliant RIS service model for dynamic configuration, an RIS channel simulator that supports 3GPP indoor factory models with multiple industrial scenarios, and practical RIS optimization strategies with finite-resolution control. Together, these elements provide a realistic end-to-end emulation platform for evaluating RIS placement, control, and performance in SWF environments prior to deployment. The presented case study demonstrates how ORIX enables the evaluation of achievable performance gains, exploration of trade-offs among key RIS design parameters, and identification of deployment strategies that balance system performance with practical implementation constraints. By bridging theoretical advances with industrial feasibility, ORIX lays the groundwork for RIS-assisted O-RAN networks to power next-generation wireless communication in industrial scenarios.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Time series Foundation Models based on Physics-Informed Synthetic Histories for Cold-Start Photovoltaic Forecasting</title>
  <link>https://arxiv.org/abs/2606.07457</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07457v1 Announce Type: cross Abstract: At commissioning time, Photovoltaic (PV) operators must forecast production before target-site observations are available, limiting the direct use of standard supervised forecasters. This cold-start setting is addressed with a zero-shot pipeline that generates a synthetic production history from plant metadata and meteorological covariates, enabling time-series foundation models (TSFMs) to forecast through inference-time conditioning. Five TSFMs are benchmarked against classical baselines under strict Cold-Start Baseline, Real Feedback, and Self-Forecast Feedback strategies. The evaluation spans $440$ PV sites across four datasets and diverse climate regimes. Covariate-aware foundation models outperform baselines by approximately $1.7-2\times$: TabPFN-TS achieves the lowest error under Real Feedback (MAE $0.514$, RMSE $0.721$ $kWh$ ${kWp}^{-1}$ ${d}^{-1}$), while Chronos-2 is most robust under Self-Forecast Feedback. Performance is largely insensitive to the synthetic-history source, indicating that accuracy is driven more by the availability of plausible temporal context than by the specific generator.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Degrees of Freedom of Over-the-Air Computation over a MIMO Gaussian Network with Two Transmitters and Two Receivers</title>
  <link>https://arxiv.org/abs/2606.06770</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06770v1 Announce Type: cross Abstract: The fundamental limits of over-the-air computation (AirComp) are explored in a two-transmitter, two-receiver MIMO Gaussian network, where both receivers demand the same aggregation of source symbols originating at the two transmitters. An AirComp degrees of freedom (ACDoF) metric is defined, constrained by an asymptotic mean-squared error threshold. For a generic MIMO setting where the two transmitters are equipped with $M_1, M_2$ antennas, and the two receivers with $N_1, N_2$ antennas, the AirComp DoF value is shown to be almost surely equal to $\min\{M_1,M_2,N_1,N_2,(1/3)\max\{M_1+M_2,N_1+N_2\}\}$. For SISO settings results are extended beyond generic channels to arbitrary channel realizations. For finite signal-to-noise ratio(SNR) settings, an iterative alternating optimization algorithm is explored.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Multiscale POD of Transformer Attention Fields: Scale-Selective Analysis via Morlet Scalogram</title>
  <link>https://arxiv.org/abs/2606.06573</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06573v1 Announce Type: cross Abstract: We introduce scale-selective Proper Orthogonal Decomposition (POD) for transformer attention fields, inspired by the use of POD for extracting energetically dominant modes from turbulent flow ensembles. The Morlet continuous wavelet transform identifies dominant temporal scales in the attention lag structure across a document ensemble; POD then extracts the energetically dominant modes at each scale from the ensemble of attention fields. The resulting modes reveal layer-dependent scale organisation, with early layers emphasising fine scales and later layers shifting toward coarser scales. We define a spectral concentration index from the POD eigenvalue decay rate and show empirically that it differentiates layers by their attention field complexity. By the classical POD optimality theorem, the extracted modes minimise the average L2 reconstruction error over the ensemble (Theorem 1), giving a data-driven effective rank for each layer. The method requires no architectural modification and no linguistic annotations: dominant attention patterns emerge from ensemble statistics alone. The turbulence analogy is structural rather than physical: we borrow ensemble covariance and modal analysis, not fluid dynamics itself.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Amortized Neural Optimization for Pre-Layout Signal Integrity Design Space Exploration using Differentiable Surrogates</title>
  <link>https://arxiv.org/abs/2606.07463</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07463v1 Announce Type: new Abstract: Pre-layout design space exploration (DSE) for high-speed signal integrity (SI) analysis is often limited by the computational cost of simulations and iterative optimization algorithms within modern electronic design automation (EDA) workflows. While machine learning surrogate models accelerate the simulation step, optimizing designs still requires utilizing iterative black-box search methods. This iterative nature scales poorly, making multi-corner sweeps computationally expensive. As a solution, this paper proposes amortized neural optimization (ANO) for pre-layout SI design. ANO entirely eliminates iterative black-box inference by utilizing fully differentiable neural network surrogate models. ANO extracts analytical gradients from the surrogate to train a global optimization policy. Instead of solving the optimization problem repeatedly at inference, the optimization process is learned offline and therefore amortized. Once the ANO policy is trained, it maps different channel contexts directly to near-optimal design parameters in a single deterministic forward pass. The efficiency and accuracy of the ANO framework are demonstrated based on three complex SI design scenarios, including DDR5 decision feedback equalization (DFE), 9-dimensional SerDes Tx/Rx co-equalization, and DDR3 DQS differential pair routing to optimize eye diagram metrics under intra-pair skew constraints. By trading roughly 10% in optimality compared to instance-specific black-box algorithms, it realizes speedups of three to four orders of magnitude. For a large-scale 320,000-instance multi-corner SerDes sweep optimization, ANO collapses what would have taken days of computation using iterative search algorithms into a single batched forward pass that completes in milliseconds. This transforms computationally expensive SI optimization into real-time and interactive pre-layout DSE.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Beyond Backscatter: InSAR coherence from detected SAR images</title>
  <link>https://arxiv.org/abs/2606.07374</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07374v1 Announce Type: new Abstract: In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>CSI Phase Averaging for High-Sensitivity Wi-Fi Sensing in Low-Multipath Environments</title>
  <link>https://arxiv.org/abs/2606.07347</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07347v1 Announce Type: new Abstract: This paper presents a low-complexity motion detection method for outdoor Wi-Fi sensing based on a model-driven approach. The method exploits the structural characteristics of the phase components in channel state information (CSI) for low-multipath propagation environments, which are generally considered disadvantageous for Wi-Fi sensing, to mitigate the phase offset errors originating from wireless devices. In addition, phase averaging provides a processing gain that reduces the random noise components, including quantization and thermal noise. The theoretical basis of the method is described and its effectiveness is experimentally evaluated using Compressed Beamforming frames obtained from commercial IEEE 802.11ac devices. The experiments primarily focus wild crows flying in an outdoor orchard environment. The experimental results demonstrate that the method can detect birds even when they fly several meters away from the direct line-of-sight path between the transmitter and receiver antennas. Furthermore, the results indicated that fluctuations caused by vegetation movement were negligible when the wind speed was less than 3~m/s. The proposed approach is expected to be applicable not only to orchard monitoring but also to other outdoor Wi-Fi sensing applications in low-multipath environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Implementation and Calibration of 3GPP-Compliant ISAC Channel Simulator</title>
  <link>https://arxiv.org/abs/2606.07328</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07328v1 Announce Type: new Abstract: Integrated sensing and communication (ISAC) has emerged as a key technology for 6G systems. To support the development of ISAC systems, accurate channel modeling and simulation for performance evaluation is essential. Recently, 3GPP introduced a standardized ISAC channel model and its associated calibration procedure for this purpose. However, due to the complexity of the modeling methodology and the lack of fully explicit implementation details in the 3GPP reports, different implementations may lead to inconsistent or unsynchronized simulation results. To address this issue, in this work, we implement the 3GPP ISAC channel model simulator specified in TR 38.901 and conduct a comprehensive calibration analysis. We compare the simulation results with the reference results reported by companies in 3GPP and discuss several key implementation details to provide insights into the implementation and calibration of the simulator. To facilitate reproducibility and further research, the developed simulator, together with the relevant datasets and calibration results, has been released as an open-source project on GitHub.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>RSMA Enabled Hierarchical UAV Networks with Non Linear Energy Harvesting: Outage Probability Analysis and UAV Placement Optimization</title>
  <link>https://arxiv.org/abs/2606.07284</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07284v1 Announce Type: new Abstract: Uncrewed aerial vehicles (UAVs) are expected to enhance connectivity, extend network coverage, and support advanced communication services in sixth-generation (6G) cellular networks, particularly in public and civil applications. Although multi-UAV systems offer greater efficiency and cost-effectiveness than single-UAV deployments, their implementation still faces several fundamental challenges that limit their reliability, sustainability, and scalability. The limited onboard energy restricts mission duration and communication continuity. Therefore, wireless energy harvesting (EH) emerges as a promising solution to overcome this limitation. However, terrestrial energy sources experience path loss, making EH from surrounding UAVs more sustainable. Moreover, rate-splitting multiple access (RSMA) remains insufficiently explored in hierarchical UAV networks under hardware impairments (HWI) and imperfect channel state information (ICSI). This paper proposes a hierarchical ad hoc UAV network with non-linear EH and RSMA to enhance both energy and cost efficiency, where UAVs harvest energy from surrounding UAVs. For a practical scenario, we consider the effect of HWI and ICSI in our proposed system. To the best of the authors knowledge, this study is the first to investigate such a scenario in the literature. The outage probability expressions for ground Internet of things (IoT) devices, each CMU, and the overall outage probability of the proposed system are derived over Nakagami-$m$ fading channels while considering practical constraints such as HWI, ICSI, and non-linear EH. Additionally, approximate outage probability expressions are derived for high transmit power regimes. Subsequently, we formulate two optimization problems to enhance reliability and performance. Our findings indicate that the proposed system outperforms all benchmarks in terms of outage probability.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Robust Secure Beamforming for Movable Antenna Enhanced Integrated Sensing and Communications</title>
  <link>https://arxiv.org/abs/2606.07104</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07104v1 Announce Type: new Abstract: In this letter, we investigate robust beamforming design for a movable antenna (MA)-enhanced secure integrated sensing and communications (ISAC) system with imperfect eaves?dropping channel state information (CSI). To improve radar sensing performance, we formulate a radar signal-to-interference?plus-noise ratio (SINR) maximization problem by jointly opti?mizing the transmit beamforming and antenna placement while ensuring communication data security. However, the resulting op?timization problem is inherently intractable due to the nonlinea mapping from antenna positions to channel coefficients, as well as the eavesdropper (Eve) channel uncertainty. To handle these challenges, we propose a block coordinate descent (BCD)-based algorithm incorporating successive convex approximation (SCA) and fractional programming (FP) techniques. Simulation results show that our proposed algorithm exhibits fast convergence and achieves a significant improvement in the radar SINR while guaranteeing communication security.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Rate-Splitting--Inspired Uplink Near-Field ISAC</title>
  <link>https://arxiv.org/abs/2606.07091</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07091v1 Announce Type: new Abstract: Integrated sensing and communication (ISAC) enables sensing and communication (S&amp;C) functionalities to share spectrum, hardware, and signal-processing resources, but the resulting inter-functionality interference creates a fundamental receiver-design challenge, particularly in uplink operation. This paper develops a rate-splitting (RS)-inspired framework for uplink near-field ISAC. The framework generalizes the sensing-centric (S-C) and communication-centric (C-C) endpoint orders of non-orthogonal multiple access (NOMA)-inspired ISAC by splitting the communication message across the sensing operation. Closed-form expressions are derived for the communication-rate (CR) and sensing-rate (SR), accounting for residual sensing interference from target-response estimation uncertainty. The achievable CR-SR rate region is characterized under sensing-matched illumination, where the proposed single-frame RS-inspired boundary contains the NOMA-inspired time-sharing region. Unlike the classical Gaussian uplink multiple access channel, where RS recovers the time-sharing dominant face, the split factor in uplink ISAC also reshapes the sensing-stage interference, allowing the RS-inspired boundary to match or strictly enlarge the S&amp;C tradeoff. High-SNR analysis shows that, for non-aligned S&amp;C channels, residual sensing interference changes the rate offsets but not the leading S&amp;C slopes, whereas in the fully-aligned case it becomes slope-limiting. Using an aperture-aware near-field channel model, large-array limits are derived, showing that achievable rates remain finite as the array grows. Numerical results validate the analysis and demonstrate the benefits of the RS-inspired scheme, the impact of residual sensing interference, and the bounded large-array behaviour induced by physically consistent near-field modelling.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Optimized Sampling of Angle-Resolved Scatterometry Data Using End-to-End Compressed Learning Model for Nanograss Deficiency Detection</title>
  <link>https://arxiv.org/abs/2606.07050</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07050v1 Announce Type: new Abstract: Reliable inspection of nanosurfaces is essential to ensure the quality of nanostructure manufacturing. Angle-resolved scatterometry provides a non-invasive inspection method that can be used in-line but often suffers from long acquisition times due to dense angular sampling. This paper addresses the data acquisition challenge by proposing an end-to-end compressed learning framework for 5-level vacancy deficiency detection in zinc oxide nanograss using ARS images. The proposed framework integrates a learnable latitude-based sampling layer with a convolutional neural network, allowing sampling and classification to be jointly optimized during training. The sampling layer exploits the physical structure of ARS patterns and learns informative latitudinal regions, which reduces the sampling search space and improves convergence. Evaluation results show that the proposed approach achieves high and stable deficiency-level classification performance under different noise conditions. Using full ARS images, the model achieves 94.2% accuracy for five-level deficiency classification and 98.6% accuracy for separating deficient from non-deficient nanosurfaces. The proposed sampling model matches full-image performance while using up to 90% fewer angular sampling points. Even when sampling points are reduced by 99.7%, the classification accuracy decreases by less than 10 percentage points. To further improve training with limited data, we also studied a GAN-based augmentation approach and used GAN-generated data for model pretraining. Augmented data resulted in fast convergence within only a few fine-tuning epochs.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Novel Stripe-based RIS Optimization for UAV Communications and Sensing in Low-Altitude Wireless Networks</title>
  <link>https://arxiv.org/abs/2606.07026</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07026v1 Announce Type: new Abstract: Low-altitude wireless networks (LAWN) envision a reconfigurable 3D network capable of supporting mission-critical aerial operations. This paper presents a reconfigurable intelligent surface (RIS)-assisted LAWN to establish a reliable communication with an unmanned aerial vehicle (UAV) across varying wireless channel conditions and signal blockages. A low complexity stripe-based RIS phase shift optimization framework is proposed to simultaneously enhance communication reliability and provide passive sensing capability for UAV tracking under 3D mobility. Unlike high-complexity optimization approaches, the proposed method leverages the inherent structural phase-gradient of the RIS adjacent elements to significantly reduce the search space for calculating and updating the RIS configuration as the UAV moves. The analysis and simulation results demonstrate that the proposed framework outperforms conventional benchmarks in convergence speed and computational efficiency, while maintaining robust, high signal-to-noise-ratio (SNR) connectivity even in the presence of phase estimation errors and low SNR regimes. In addition, the measurement experiments using a real RIS prototype in an outdoor campus environment are performed to demonstrate the practical viability of the proposed approach.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Learn to Access and Backhaul the Sky: Multi-Scale Radio Map Guided Multi-UAV Cooperation</title>
  <link>https://arxiv.org/abs/2606.06954</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06954v1 Announce Type: new Abstract: Driven by the emerging low-altitude economy, uncrewed aerial vehicle (UAV) swarms offer flexible integrated air-ground access and backhaul. However, providing seamless connectivity is difficult due to the interdependent dynamics of user mobility and building blockages in these 3D scenarios. These factors create rapidly shifting bottlenecks in end-to-end paths. Furthermore, the multi-dimensional nature of joint control limits the effectiveness of traditional heuristics. To address these challenges, a \textbf{\underline{M}}ulti-Scale \textbf{\underline{R}}adio \textbf{\underline{M}}ap-\textbf{\underline{G}}uided (MRMG) framework is proposed. The MRMG framework handles heterogeneous dynamics by integrating three distinct levels of radio information: global-level maps provide regional coverage insights, local-level maps capture neighborhood-scale service conditions, and link-level maps characterize high-resolution channel features. This design effectively decouples macro-movement from micro-link adaptation. To yield long-term performance improvements, A multi-agent reinforcement learning (MARL) controller learns cooperative policies for UAV movement, next-hop selection, and transmit-power control. Simulation results show that the MRMG framework not only improves network throughput but also significantly bolsters cell-edge service, nearly doubling the 5th-percentile user rate.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Variable-Length Finite-Rate CSI Feedback With Generative Priors</title>
  <link>https://arxiv.org/abs/2606.06846</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06846v1 Announce Type: new Abstract: This letter studies variable-length finite-rate CSI feedback from a structural perspective and proposes CsiCoGen, a novel generative feedback structure with a transferable codebook mechanism without joint training. The UE maps $H_0$ into an ordered sequence of codebook indices, while the BS recursively recovers CSI from any received partial sequence of feedback indices using a shared denoising prior. This enables flexible control of feedback sequence length and per-step quantization precision through codebook size. CsiCoGen does not require jointly training a task-specific feedback encoder or codebook with the reconstructor, and the same online structure can be paired with different pretrained denoisers. In this work, we instantiate the decoder with a generative diffusion model. Simulation results on COST2100 show favorable rate-NMSE and rate-$\rho$ tradeoffs against representative baselines, with CsiCoGen reaching about -31 dB indoor NMSE and -20 dB outdoor NMSE in the high-rate regime while demonstrating scalable decoding complexity and adjustable per-step quantization precision.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Copula Function Parameter Regions in Analyzing Wireless Communications Performances</title>
  <link>https://arxiv.org/abs/2606.06792</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06792v1 Announce Type: new Abstract: Copula functions have been widely employed in wireless communication analysis to model dependence structures and evaluate system performance. However, existing studies generally express performance metrics in terms of copula dependence parameters without explicitly characterizing their admissible regions. This letter introduces the concept of copula dependence parameter regions and investigates its significance in wireless communications. Considering a two-user wireless multiple access channel (MAC) with correlated Rayleigh fading modeled by the bivariate Farlie--Gumbel--Morgenstern (FGM) copula, explicit parameter regions are derived from communication-theoretic and probabilistic perspectives using outage probability and Pearson correlation coefficient (PCC) constraints. The results show that practical communication and statistical requirements can significantly shrink the classical copula admissible interval, rendering some theoretically admissible dependence structures infeasible. Numerical examples illustrate the proposed concept and its practical implications.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Angular Sector-Based Sparse Array Design for Adaptive Beamforming Using Deep Learning</title>
  <link>https://arxiv.org/abs/2606.06732</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06732v1 Announce Type: new Abstract: Efficient sparse array reconfigurability is essential for cognitive sensing in dynamic radio frequency environments, where rapid interference variations require both adaptability and stability. This work presents a framework for designing sparse arrays optimized over broad angular sectors, enabling near-optimal beamforming that maximizes the signal-to-interference-plus-noise ratio (SINR) across a range of interferer angles. Full data correlation matrices are computed for candidate configurations, and an angular-sector-based class reduction strategy is applied to merge adjacent sectors dominated by the same configuration, resulting in 56 representative classes. Controlled up- and down-sampling produce four dataset variants involving, high and low sample count, balanced and unbalanced datasets, to systematically evaluate the effects of dataset size and class distribution on neural network performance. A lightweight convolutional neural network (CNN) and a deeper ResNet 50 architecture are trained and evaluated using these datasets. Results demonstrate high classification accuracy, with ResNet 50 achieving up to 97.3%, while SINR deviations remain below 1% for most classes and below 5% even for challenging interference angles near broadside. The proposed approach enables robust sparse array selection, maintains strong SINR performance, reduces unnecessary reconfigurations, and provides an effective framework for real-time cognitive sensing and adaptive interference mitigation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Deep Learning Based Sparse Array Design with Pre-Steering for Adaptive Beamforming</title>
  <link>https://arxiv.org/abs/2606.06723</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06723v1 Announce Type: new Abstract: This paper investigates the use of convolutional neural networks (CNNs) for learning sparse array configurations that achieve near-optimal beamforming under varying source and interference angles. Unlike conventional or convex optimization based algorithms, the proposed deep learning approach enables rapid reconfiguration of sparse arrays in highly dynamic propagation environments. The paper considers a single desired source and a single interference signal at arbitrary angles, analyzing scenarios with both fixed and varying desired source directions. To avoid retraining for each possible source angle, an array pre-steering strategy is introduced, whereby the network is trained only at broadside, while test inputs are pre-steered to align with the broadside direction. To account for practical imperfections, the effect of pre-steering errors is examined, and a robust error-augmented training is adopted. The approach systematically incorporates small, structured pre-steering perturbations during training, enabling the network to maintain high classification accuracy and maximize the signal-to-interference-plus-noise ratio (SINR) even under angular uncertainty. The results demonstrate that the proposed method achieves over 90% test accuracy across wide ranges of source and interference angles, highlighting its potential for real-time, robust sparse array configuration in dynamic environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Variational Bayes Estimation for Affine-Precoded Superimposed Pilots in Partially Connected Dual-Wideband Tera-Hertz MU-MIMO Systems</title>
  <link>https://arxiv.org/abs/2606.06672</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06672v1 Announce Type: new Abstract: This work conceives two affine precoding based system models, common precoding with joint channel estimation (CP-JCE) and user-specific precoding for decoupled channel estimation (USPDCE). Considering a dual-wideband effected partially connected architecture, we rigorously model the terahertz (THz) multiple input multiple output (MIMO) channel for each subarray corresponding to each user by incorporating the absorption, reflection, and freespace losses. Next, to address the significant bandwidth overhead associated with conventional pilot-based channel estimation, we employ superimposed pilots. Building on this, we formulate a structured sparse channel model and develop a variational Bayesian inference algorithm that jointly estimates the channel coefficients and learns the underlying sparsity structure through hyperparameter inference, thereby enabling robust and high-precision superimposed pilotbased channel estimation under severe model uncertainty. Lastly, we compare our results for both systems and provide a trade-off analysis between them.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Semantic Forwarding and Codebook-Enhanced Model Division Multiple Access for Satellite-Terrestrial Networks</title>
  <link>https://arxiv.org/abs/2603.02536</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.02536v2 Announce Type: replace-cross Abstract: Satellite-terrestrial communications are severely constrained by high path loss, limited spectrum resources, and time-varying channel conditions, rendering conventional bit-level transmission schemes inefficient and fragile, particularly in low signal-to-noise ratio (SNR) regimes. Semantic communication has emerged as a promising paradigm to address these challenges by prioritizing task-relevant information over exact bit recovery. In this paper, we propose a semantic forwarding-based semantic communication (SFSC) framework optimized for satellite-terrestrial networks. Specifically, we develop a vector-quantized joint semantic coding and modulation scheme, in which the semantic encoder and semantic codebook are jointly optimized to shape the constellation symbol distribution, improving channel adaptability and semantic compression efficiency. To mitigate noise accumulation and reduce on-board computational burden, we introduce a satellite semantic forwarding mechanism, enabling relay satellites to forward signals directly at the semantic level without full decoding and re-encoding. Furthermore, we design a channel-aware semantic reconstruction scheme based on feature-wise linear modulation (FiLM) to fuse the received SNR with semantic features, enhancing robustness under dynamic channel conditions. To support multi-user access, we further propose a codebook split-enhanced model division multiple access (CS-MDMA) method to improve spectral efficiency. Simulation results show that the proposed SFSC framework achieves a peak signal-to-noise ratio (PSNR) gain of approximately 7.9 dB over existing benchmarks in the low-SNR regime, demonstrating its effectiveness for robust and spectrum-efficient semantic transmission in satellite-terrestrial networks.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability</title>
  <link>https://arxiv.org/abs/2603.21510</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.21510v3 Announce Type: replace Abstract: This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>VIRTUS-FPP: Virtual Sensor Modeling for Fringe Projection Profilometry in NVIDIA Isaac Sim</title>
  <link>https://arxiv.org/abs/2509.22685</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.22685v2 Announce Type: replace Abstract: Fringe projection profilometry (FPP) is a high-precision structured-light sensing technique for 3D surface reconstruction, yet its practical deployment is often constrained by complex calibration procedures, sensitivity to environmental conditions, and the high cost of physical experimentation. At the same time, robotics research increasingly relies on simulation platforms such as NVIDIA Isaac Sim for scalable development and validation, but accurate virtual representations of optical metrology sensors such as FPP are not currently available. In this work, we present VIRTUS-FPP, the first end-to-end virtual sensor modeling framework for fringe projection profilometry implemented in NVIDIA Isaac Sim, enabling physically grounded simulation of the complete FPP pipeline, including structured light projection, image formation, calibration, and 3D reconstruction, without dependence on pre-calibrated physical systems. The framework leverages an inverse camera model for projector representation, ensuring geometric and photometric fidelity consistent with structured-light principles. By bridging optical metrology and robotics simulation, VIRTUS-FPP enables high-fidelity synthetic data generation, systematic evaluation of sensing pipelines, and digital twin replication of real-world FPP systems. Experimental results demonstrate sub-millimeter reconstruction accuracy and strong correspondence between simulated and physical measurements, highlighting the framework&#39;s effectiveness and its potential to advance perception-driven robotics, simulation-to-reality transfer, and scalable optical sensor design.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Unsupervised Learning Based Focal Stack Camera Depth Estimation</title>
  <link>https://arxiv.org/abs/2203.07904</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2203.07904v3 Announce Type: replace Abstract: We propose an unsupervised deep learning based method to estimate depth from focal stack camera images. On the NYU-v2 dataset, our method achieves much better depth estimation accuracy compared to single-image based methods.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming</title>
  <link>https://arxiv.org/abs/2606.07179</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07179v1 Announce Type: cross Abstract: Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textit{discrete layering}, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textit{continuous-layering} representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65\% to under 25\%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4$\times$ and 5.5$\times$, respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: https://yuang-ian.github.io/evogs/</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images</title>
  <link>https://arxiv.org/abs/2606.06537</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06537v1 Announce Type: cross Abstract: Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Deployed trusted-node quantum key distribution over 300 km with a multi-core fiber access link</title>
  <link>https://arxiv.org/abs/2606.06107</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06107v1 Announce Type: cross Abstract: Quantum key distribution (QKD) is increasingly considered for deployment in realistic communication networks, where long distances, heterogeneous fiber infrastructure, and coexistence with classical traffic present substantial challenges. Here, we demonstrate trusted-node QKD between Link\&quot;oping University and the Stockholm hub of the Swedish national quantum communication infrastructure over 270 km of deployed single-mode fiber, extended by a 33 km multi-core fiber (MCF) segment emulating a metropolitan access link, for a total distance of 303 km. The two sub-links use commercial QKD systems whose receivers are interfaced with external superconducting nanowire single-photon detectors, enabling operation at losses beyond those supported by standard internal gated-mode detectors. We operate the link while actively switching the QKD channel between two MCF cores, with co-propagating Ethernet traffic and injected broadband optical noise in the other cores. The results demonstrate the integration of commercial QKD into demanding, dynamically reconfigurable fiber infrastructure relevant to future hybrid quantum-classical networks. Finally, using the generated secret keys, we illustrate how limited and time-varying QKD throughput affects one-time-pad-protected image transmission: image fidelity depends strongly on the available QKD-generated key budget and the choice of compression algorithm, highlighting application-level challenges for QKD-based encryption in realistic scenarios.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios</title>
  <link>https://arxiv.org/abs/2606.07381</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07381v1 Announce Type: new Abstract: Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p &lt; 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition</title>
  <link>https://arxiv.org/abs/2606.07063</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07063v1 Announce Type: new Abstract: Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>DaX: Learning General Pathology Representations Across Scales</title>
  <link>https://arxiv.org/abs/2606.06983</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06983v1 Announce Type: new Abstract: Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A 3D Formulation of the Extended Phaseless Rytov Approximation</title>
  <link>https://arxiv.org/abs/2606.06933</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06933v1 Announce Type: new Abstract: The extended Phaseless Rytov Approximation (xPRA) is a recently proposed device-free RF imaging technique that provides high-resolution reconstructions of the imaging region using only phaseless measurements, such as received signal strength (RSS). Because of its phaseless formulation, it can be implemented straightforwardly using existing wireless commu?nication infrastructure. It also outperforms well-known device?free phaseless RF imaging methods such as Radio Tomographic Imaging (RTI). The linear phaseless formulation used in xPRA(and RTI) makes these methods potentially useful for integrated sensing and communication (ISAC) systems in next generation wireless networks since they do not require wide bandwidths. However, so far, both xPRA and RTI have primarily been formulated in two dimensions (2D). This paper introduces a 3D extension of xPRA, which we call the extended three-dimensional phaseless Rytov approximation (x3DPRA). The novelty of our approach is that it preserves the straightforward implementation advantages of RTI and xPRA while enabling volumetric (3D) imaging. Simulation results show that x3DPRA provides good estimates of location and shape and can also reconstruct object material attenuation. We present the 3D formulation, validate it with a 2D model comparison, and report simulation results demonstrating its performance.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images</title>
  <link>https://arxiv.org/abs/2606.06847</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06847v1 Announce Type: new Abstract: Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws</title>
  <link>https://arxiv.org/abs/2606.06725</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06725v1 Announce Type: new Abstract: Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring</title>
  <link>https://arxiv.org/abs/2606.06540</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06540v1 Announce Type: new Abstract: We introduce ErA (Error-Aware Deep Unrolling Network), an end-to-end frame work for single-image defocus deblurring. ErA jointly learns a compact kerne basis and per-pixel weights, while an error-aware term in Augmented Lagrangian unrolling corrects kernel estimation errors via alternating updates and ResUNet denoisers. It achieves state-of-the-art PSNR/SSIM on DPDD, RealDOF, and RTF, and shows strong generalization on CUHK without ground truth.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models</title>
  <link>https://arxiv.org/abs/2606.06534</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06534v1 Announce Type: new Abstract: Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery</title>
  <link>https://arxiv.org/abs/2606.06524</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06524v1 Announce Type: new Abstract: Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Dilated Symmetric Difference for Binary Image Comparison</title>
  <link>https://arxiv.org/abs/2606.06512</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06512v1 Announce Type: new Abstract: The comparison of two binary images is formulated in terms of mathematical morphology. A new operator, the dilated symmetric difference, is introduced. It is shown that the dilated symmetric difference effectively detects differences between binary images, provided that the residual alignment error is within specified bounds.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.06065</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06065v2 Announce Type: replace-cross Abstract: Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Do speech foundation models perceive speaker similarity as humans do?</title>
  <link>https://arxiv.org/abs/2606.05739</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05739v2 Announce Type: replace-cross Abstract: This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio</title>
  <link>https://arxiv.org/abs/2603.08683</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.08683v2 Announce Type: replace-cross Abstract: Autoregressive &quot;language&quot; models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.05763</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05763v2 Announce Type: replace Abstract: Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech</title>
  <link>https://arxiv.org/abs/2606.07494</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07494v1 Announce Type: cross Abstract: Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates &quot;in-the-wild&quot; variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition</title>
  <link>https://arxiv.org/abs/2606.06985</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06985v1 Announce Type: cross Abstract: Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track</title>
  <link>https://arxiv.org/abs/2606.07264</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07264v1 Announce Type: new Abstract: Audio reasoning requires multi-step, evidence-grounded inference over temporally dynamic and acoustically mixed signals, exceeding conventional perception tasks such as ASR or captioning. We present VISA, our submission to the Interspeech 2026 Audio Reasoning Challenge (Agent Track), evaluated via the MMAR Rubrics for correctness and reasoning quality. Under a &quot;LALM as a Tool&quot; paradigm, VISA strengthens large audio language models with auxiliary multi-modal evidence while avoiding heavy orchestration. The system integrates three components: multi-modal feature extraction for complementary audio and acoustic-visual clues, model-voting inference with consistency checking for stable predictions, and fine-grained category-aware routing to resolve disagreements and select rubric-aligned reasoning chains. On the official Agent Track leaderboard, VISA ranks 2nd overall with a 66.23% Rubrics score. It also achieves 77.40% Accuracy, the highest among all systems listed across both the Single Model and Agent tracks.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Assessing True Generalisability of Audio-Visual Speech Recognisers</title>
  <link>https://arxiv.org/abs/2606.07259</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07259v1 Announce Type: new Abstract: Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Audio Imitator: Controlling Timbre and Tempo in Video2Audio Synthesis with Audio Reference</title>
  <link>https://arxiv.org/abs/2606.07182</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.07182v1 Announce Type: new Abstract: Video-to-audio generation has made significant progress in achieving semantic consistency and temporal alignment from silent videos. However, audio contains rich stylistic attributes such as timbre and tempo that are difficult to infer from visual and textual inputs alone. While reference audio can serve as additional conditioning, it is typically treated as a holistic signal, limiting fine-grained style control. We propose AudioIM, an attribute-aware framework that explicitly models timbre and tempo as separate control factors rather than relying on holistic prompt conditioning. Dual encoders extract complementary timbre-related and tempo-related representations, which are injected through global conditioning. A masking-based training strategy enables effective latent prompt conditioning at inference. Experiments on VGGSound show improved style similarity while preserving semantic alignment and synchronization. Audio samples are available at: https://anonymousdemo757.github.io/.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>FSC-Net: Integrating Fast Fourier Convolutions and Progressive Learning for Speech Bandwidth Extension</title>
  <link>https://arxiv.org/abs/2606.06962</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06962v1 Announce Type: new Abstract: Speech bandwidth extension (BWE) aims to reconstruct high-fidelity wideband audio from narrowband inputs. While recent approaches have made significant progress, they often struggle to reconstruct realistic high-frequency phase and harmonic structures, leading to perceptual artifacts. In this paper, we propose FSC-Net (Full-Spectrum Context Network), a parameter-efficient architecture designed to explicitly model cross-band harmonic dependencies. By integrating Fast Fourier Convolutions (FFCs) into a complex spectral mapping framework, FSC-Net expands its receptive field to the entire spectrum, capturing long-range frequency interactions effectively. To address the ill-posed nature of high-frequency generation, our novel frequency-progressive learning curriculum guides the network to reconstruct spectral details from coarse to fine. Experimental results on the VCTK and unseen EARS datasets demonstrate that FSC-Net delivers consistently strong reconstruction quality and generalization, particularly in the challenging VCTK 4 kHz-to-48 kHz task. Compared to scaled-up baselines, our model attains leading LSD and PESQ scores while maintaining a highly compact parameter footprint (1.54 M).</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models</title>
  <link>https://arxiv.org/abs/2606.06940</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06940v1 Announce Type: new Abstract: While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion&#39;&#39; dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models</title>
  <link>https://arxiv.org/abs/2606.06907</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06907v1 Announce Type: new Abstract: Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails</title>
  <link>https://arxiv.org/abs/2606.06837</link>
  <pubDate>Mon, 08 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06837v1 Announce Type: new Abstract: Scripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 +- 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Enhancing Audio Captioning with Auxiliary AudioSet Semantics</title>
  <link>https://arxiv.org/abs/2606.05717</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05717v1 Announce Type: new Abstract: Automatic Audio Captioning (AAC) seeks to generate natural language descriptions of complex acoustic scenes, bridging auditory perception and language understanding. However, word-selection indeterminacy and increasing reliance on large-scale sequence-to-sequence or LLM-based models limit practical deployment. We propose a resource-efficient AAC framework that explicitly grounds caption generation in auxiliary AudioSet semantics. Frame-level acoustic representations extracted using a ConvNeXt encoder are augmented with top-$K$ predicted AudioSet keywords, providing structured contextual cues for decoding. A compact six-layer BART-style decoder conditions on this joint acoustic-semantic representation, enabling caption generation without LLM-scale decoding. The proposed design balances semantic grounding and computational efficiency within a compact architecture. Evaluations on Clotho V2 and AudioCaps confirm competitive caption quality under practical deployment constraints.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>An Ultra-Low-Bitrate Neural Speech Codec with Plain-to-Pseudo Synergistic Vector Quantization</title>
  <link>https://arxiv.org/abs/2606.05876</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05876v1 Announce Type: new Abstract: Most neural speech codecs use residual vector quantization (RVQ), in which later VQs contribute less but consume the same bitrate, leading to inefficiency. We propose P2PSynCodec, an ultra-low-bitrate neural speech codec with a plain-to-pseudo synergistic vector quantizer (P2PSVQ). P2PSVQ consists of one plain VQ and multiple pseudo VQs. The plain VQ produces basic tokens by quantization, while the pseudo VQs generate auxiliary tokens by neural prediction and incur zero transmitted bitrate. Thus, speech is decoded from the plain-VQ tokens together with predicted pseudo-VQ tokens, greatly reducing bitrate. Experiments show that P2PSynCodec achieves speech reconstruction quality comparable to competing codecs at 2.0 kbps while operating at only 0.5 kbps, demonstrating high efficiency for ultra-low-bitrate speech coding.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization</title>
  <link>https://arxiv.org/abs/2606.05892</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05892v1 Announce Type: new Abstract: Neural speech codecs are key to speech transmission and storage, but most use uniform quantization across frames, allocating the same bitrate regardless of content and wasting bits. We propose VoCodec, a low-bitrate streamable neural speech codec with voicing-driven quantization that assigns higher bitrate to voiced frames and lower bitrate to unvoiced frames according to perceptual sensitivity. VoCodec embeds a voicing detector in a fully causal encoder-quantizer-decoder neural coding framework, using residual scalar-vector quantization for voiced frames and simple scalar quantization for unvoiced ones. Experiments show that on the LibriTTS dataset at a 16 kHz sampling rate, VoCodec outperforms baseline neural speech codecs even at a bitrate as low as 1.1 kbps. Our further experiments also confirm that introducing voicing-driven quantization can effectively reduce the bitrate by approximately 27% compared with uniform quantization strategy.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer&#39;s Disease Detection</title>
  <link>https://arxiv.org/abs/2606.06170</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06170v1 Announce Type: new Abstract: Speech-based Alzheimer&#39;s Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Revisiting Lexicon Evaluation in Unsupervised Word Discovery</title>
  <link>https://arxiv.org/abs/2606.06183</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06183v1 Announce Type: new Abstract: Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding</title>
  <link>https://arxiv.org/abs/2606.06444</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06444v1 Announce Type: new Abstract: Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models</title>
  <link>https://arxiv.org/abs/2606.05177</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05177v1 Announce Type: cross Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>FORTE: FOL-guided Optimal Refinement for Text-audio rEtrieval</title>
  <link>https://arxiv.org/abs/2606.05812</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05812v1 Announce Type: cross Abstract: Text-to-audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine-grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter-efficient cross-modal alignment to improve retrieval precision. Our approach first transforms queries into first-order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate-aware re-ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine-grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross-modal retrieval.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs</title>
  <link>https://arxiv.org/abs/2606.05846</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05846v1 Announce Type: cross Abstract: Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection</title>
  <link>https://arxiv.org/abs/2606.05931</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05931v1 Announce Type: cross Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning</title>
  <link>https://arxiv.org/abs/2603.17837</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.17837v5 Announce Type: replace Abstract: During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional &quot;thinking&quot; mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user&#39;s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems</title>
  <link>https://arxiv.org/abs/2502.14145</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2502.14145v3 Announce Type: replace-cross Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Absorbing Discrete Diffusion for Speech Enhancement</title>
  <link>https://arxiv.org/abs/2602.22417</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2602.22417v2 Announce Type: replace-cross Abstract: Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Anti-Hyperspectral Anomaly Detection: A First Study on Stealthy Lipschitz-Forcing Perturbations Against Unknown Detectors</title>
  <link>https://arxiv.org/abs/2606.05369</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05369v1 Announce Type: new Abstract: Hyperspectral imagery represents the best contemporary technology to remotely detect anomalous objects. Nevertheless, hyperspectral anomaly detection (HAD) technique makes ground facilities/situations completely exposed. For the first time, we develop the first anti-HAD (AHAD) technique rendering the key objects undetected, without perfect coordinate/position state information (CSI) of the detectors (e.g., reconnaissance aircraft). Our AHAD algorithm is generally applicable to defend against almost all the existing benchmark data-driven and model-driven HAD methods. AHAD is fundamentally different from conventional adversarial attacks, so novel theory is needed. We customize novel regularizers for assimilating real anomalies into the backgrounds (ARAB) and fooling the detectors with pseudo-anomalies, thereby optimizing an energy-efficient stealthy perturbation signal for AHAD. The ARAB regularization is mathematically interpretable as flattening the topology-enhanced anomaly/background structures in the feature space, hence termed Lipschitz-forcing perturbations. Considering the imperfect CSI, we further develop a robust AHAD criterion, where the uncertainty is mathematically described as matrix-shifting misalignment for statistically generating the robust perturbation. Comprehensive experiments demonstrate the effectiveness and robustness of our AHAD algorithm across diverse real-world datasets. Remarkably, our algorithm generates a single AHAD perturbation signal that can simultaneously evade almost all benchmark detectors, greatly enhancing its practicality, given that the reconnaissance detector type is usually unknown. To the best of our knowledge, this is the first formal AHAD study. As a side contribution, we propose a new quantitative performance index, ArmCBA, to evaluate the robustness of an HAD method against our AHAD signal.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Hadamard-Based Recursive Aperture Decoded Ultrasound Imaging (READI) With Estimated Motion-Compensated Compounding (EMC2) Using Top-Orthogonal to Bottom Electrode (TOBE) Arrays</title>
  <link>https://arxiv.org/abs/2509.08781</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.08781v4 Announce Type: replace Abstract: Hadamard matrix-based aperture encoding is a method for producing synthetic aperture datasets with high Signal-to-Noise Ratios. Recently, the pulse inversion capabilities of bias-sensitive Top-Orthogonal to Bottom Electrode (TOBE) arrays have driven the development of multiple Hadamard-based sequences. These sequences produce high-quality static images but are sensitive to motion. This work introduces Recursive Aperture Decoded Imaging (READI) and Estimated Motion-Compensated Compounding (EMC2), which look to reduce this sensitivity. READI is a novel decoding and beamforming technique for Hadamard aperture-encoded sequences that produces multiple low-resolution images from subsets of the full sequence. These READI images are less affected by motion and sum to form the complete high-resolution image. EMC2 describes the process of comparing these low-resolution images to estimate the underlying motion, then warping them to align before compounding. This produces a high-resolution image that is resiliant to motion. READI with EMC2 applied to the TOBE-based Fast Orthogonal Row-Column Electronic Scanning (FORCES) sequence. It is shown to fully restore images corrupted by probe motion and to recover tissue speckle and boundaries in images of a beating heart phantom. READI low-resolution images by themselves are demonstrated to be a marked improvement over a sparse Hadamard scheme with the same transmit count, and are able to recover blood speckle at a flow rate of 42 cm/s.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis</title>
  <link>https://arxiv.org/abs/2603.11294</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.11294v2 Announce Type: replace Abstract: Anisotropic image analysis is ubiquitous in medical and scientific imaging, and while the literature on the subject is extensive, the robustness to numerical rotations of numerous methods remains to be studied. Indeed, the principal directions and angular profile of a rotated image are often expected to rotate accordingly. In this work, we propose a new spectral method for the anisotropic analysis of images (EquivAnIA) using two established directional filters, namely cake wavelets, and ridge filters. We show that it is robust to numerical rotations throughout extensive experiments on synthetic and real-world images containing geometric structures or textures, and we also apply it successfully for a task of angular image registration. The code is available at https://github.com/jscanvic/Anisotropic-Analysis</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Data Detection for Massive MIMO Systems with 1-Bit Quantized Dithered Linear Precoding</title>
  <link>https://arxiv.org/abs/2606.05447</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05447v1 Announce Type: new Abstract: The power consumption of the analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) in fully digital massive multiple-input multiple-output (MIMO) systems motivates the adoption of low-resolution architectures. In particular, 1-bit DACs reduce the power consumption and hardware complexity at the transmitter, but introduce severe transmit-side quantization distortion. In this paper, we investigate data detection for a point-to-point massive MIMO system with 1-bit DACs at the transmitter, where the linearly precoded signal is dithered prior to quantization, and either full-resolution or 1-bit ADCs at the receiver. Assuming that the dither vector applied at the transmitter is known at the receiver, we first develop softestimation-based data detection methods with symbol-independent dither removal for both full-resolution and 1-bit ADCs. We then introduce a new symbol-dependent linearization of the transmitted signal at the output of the 1-bit DACs and use it to derive maximum-likelihood (ML)-based data detection methods that directly recover the data symbol vector from the received signal. For full-resolution ADCs, this leads to an ML-based method with and without dither removal. For 1-bit ADCs, we develop an approximate ML-based method that exploits the derived statistics of the received signal without dither removal. We also propose low-complexity variants of the ML-based methods to mitigate the exponential complexity growth with the number of streams. Numerical results in terms of symbol error rate highlight the critical role of the dither power and demonstrate that the proposed ML-based methods (along with their low-complexity variants) achieve significant gains over a baseline based on binary ML detection via a homotopy algorithm.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>3D Spherical Fluid Antennas for Spatially Reconfigurable Communications</title>
  <link>https://arxiv.org/abs/2606.05589</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05589v1 Announce Type: new Abstract: As sixth-generation (6G) wireless systems evolve toward higher frequency bands, large-scale antenna arrays, and intelligent interaction with the wireless environment, conventional fixed-position antennas (FPAs) are increasingly constrained by limited spatial degrees of freedom and insufficient hardware-level adaptability. Fluid antenna systems (FAS) provide new physical-layer flexibility by dynamically reconfiguring antenna ports, geometries, and radiation characteristics. However, existing studies have mainly focused on one- or two-dimensional apertures, leaving the spatial reconfigurability required for complex three-dimensional (3D) propagation environments insufficiently exploited. In this article, we present a 3D spherical fluid antenna system (3D SFAS) architecture for flexible spatially reconfigurable communications. By activating radiating elements in different spherical regions, 3D SFAS realizes array-level spatial reconfiguration through flexible region switching. Within the selected regions, element-level reconfiguration further adjusts the effective aperture size, array topology, and radiation characteristics. This joint framework enables flexible beamforming, concurrent multi-region transmission, blockage-adaptive aperture switching, effective-aperture reconfiguration, and high-resolution 3D aperture control. We also discuss its potential applications in space-air-ground integrated networks, high-mobility communications, integrated sensing and communication systems, and emergency communications. Numerical results demonstrate the potential of 3D SFAS to improve wireless communication performance through flexible spatial reconfiguration. Overall, 3D SFAS extends FAS design beyond 2D position switching toward comprehensive 3D spatial reconfigurability.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Subarray based Wideband Beamforming and Variational Sparse CSI Estimation for Low-Resolution MU THz MIMO Systems</title>
  <link>https://arxiv.org/abs/2606.06110</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06110v1 Announce Type: new Abstract: This work conceives a unified channel estimation and beamforming framework, formulated within the principles of variational Bayesian inference. Recognizing the limitations imposed by hardware constraints, frequency-dependent propagation effects, and the structural restrictions of partially connected architectures in the Terahertz (THz) band, we formulate a dual-wideband channel model incorporating root raised cosine (RRC) pulse shape to account its band-limited nature. To further address the nonlinear distortions introduced by low-resolution ADCs, Bussgang decomposition is employed, enabling a tractable linearized inference process. Unlike conventional techniques, the proposed method accommodates both on-grid and off-grid angular domains, capturing spatial sparsity with improved resolution and robustness. The multi-user (MU) Bayesian Cram\&#39;er-Rao lower bound is also derived to benchmark the performance of the proposed estimator. Moreover, the framework incorporates a true time delay (TTD)-based hybrid transceiver design that inherently compensates for the beam-squint effect; a frequency-dependent angular deviation that arises due to the fixedphase nature of the conventional beamformer in wideband systems, thereby ensuring accurate directional alignment across all subcarriers. Extensive simulation results validate the effectiveness of the proposed variational Bayesian inference-based estimator and the TTD-enabled beamforming architecture, highlighting their robustness and performance gains under practical wideband THz system.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Foundation Models for Wireless Communications: From PHY Intelligence to Network Autonomy</title>
  <link>https://arxiv.org/abs/2606.06239</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06239v1 Announce Type: new Abstract: 6G networks will introduce unprecedented complexity, which calls for a paradigm shift in network optimization and management. Artificial intelligence (AI)-based solutions, especially those enabled by the recently developed foundation models, have been recognized as promising candidates. Foundation models are large-scale AI models with general-purpose feature extraction capabilities, and once trained on massive amounts of data, they can be adapted to solve a wide range of downstream tasks, either in a zero-shot manner or with few-shot fine-tuning. This article provides a comprehensive overview of how foundation models are reshaping physical-layer processing and wireless resource management across three progressive paradigms. First, we examine the adaptation of off-the-shelf pre-trained foundation models to various wireless tasks. Second, we explore wireless-native foundation models, built from scratch on wireless data to bridge cross-domain modality gaps and capture universal wireless-domain physical characteristics. Third, we highlight agentic foundation models, which elevate static data processing into autonomous, reasoning-driven network orchestration. Furthermore, we discuss the impact of applying foundation models to emerging 6G frontiers, including integrated sensing and communications (ISAC), new multiple-input multiple-output (MIMO) architectures, semantic communications, and system-level network autonomy. Finally, we identify critical open challenges and opportunities, charting a promising path toward fully intelligent and adaptive wireless networks.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>LatentWave: JEPA Pretraining for Wireless Foundation Models</title>
  <link>https://arxiv.org/abs/2606.06373</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06373v1 Announce Type: new Abstract: Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability</title>
  <link>https://arxiv.org/abs/2606.05191</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05191v1 Announce Type: cross Abstract: Data-driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time-series measurements. A known issue is the ill-conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post-hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons&#39; inspired by characteristic curves (CCs), defining a hypothesis-driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time-dependent data.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Central Description Length (CDL) Clustering Validation Index</title>
  <link>https://arxiv.org/abs/2606.05230</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05230v1 Announce Type: cross Abstract: Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bounded Deep Unfolding for Joint Beamforming and Scheduling in Multi-Cell MIMO Networks</title>
  <link>https://arxiv.org/abs/2606.05246</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05246v1 Announce Type: cross Abstract: This paper investigates the joint resource block group (RBG) scheduling and beamforming optimization problem for weighted sum-rate (WSR) maximization in multi-cell multiple-input multiple-output (MIMO) downlink networks. While the Fast Fractional Programming (FastFP) framework provides a reliable model-driven solution, it suffers from conservative continuous beamforming updates and prohibitive computational overhead during the discrete RBG matching phase. To address these bottlenecks, we propose a joint deep unfolding framework comprising two core modules: P-Net and K-Net. For continuous beamforming, P-Net learns an adaptive relaxation factor along the analytical FastFP update direction. By strictly constraining this factor within an ascent-preserving interval, P-Net accelerates the optimization trajectory while rigorously retaining monotonic improvement and stationary-point convergence guarantees. For discrete RBG scheduling, K-Net learns a long-horizon priority policy that guides a low-complexity greedy assignment, effectively preserving the assignment quality while bypassing the high complexity of Hungarian matching. Both networks leverage analytical algorithmic priors and utilize recurrent parameter sharing, enabling flexible inference beyond the training horizon. Extensive simulations demonstrate that the proposed joint framework achieves higher WSR and faster execution times than conventional model-driven baselines, while generalizing robustly across unseen network scales, antenna configurations, and channel conditions without retraining.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models</title>
  <link>https://arxiv.org/abs/2606.05481</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05481v1 Announce Type: cross Abstract: Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care</title>
  <link>https://arxiv.org/abs/2606.05994</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05994v1 Announce Type: cross Abstract: Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks</title>
  <link>https://arxiv.org/abs/2506.12308</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2506.12308v4 Announce Type: replace Abstract: In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN&#39;s distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Predictive Control over Low-Altitude Wireless Networks: Joint Trajectory Design and Resource Allocation</title>
  <link>https://arxiv.org/abs/2507.02374</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.02374v3 Announce Type: replace Abstract: Low-altitude wireless networks (LAWNs) have been envisioned as flexible and transformative platforms for enabling delay-sensitive control applications in Internet of Things (IoT) systems. In this work, we investigate the real-time wireless control over LAWNs, where an aerial drone is employed to serve multiple mobile automated guided vehicles (AGVs) via finite blocklength (FBL) transmission. Toward this end, we adopt the model predictive control (MPC) to ensure accurate trajectory tracking, while we analyze the communication reliability using the outage probability. Subsequently, we formulate an optimization problem to jointly determine control policy, transmit power allocation, and drone trajectory by accounting for the maximum travel distance and control input constraints. To address the resultant non-convex optimization problem, we first derive the closed-form expression of the outage probability under FBL transmission. Based on this, we reformulate the original problem as a quadratic programming (QP) problem, followed by developing an alternating optimization (AO) framework. Specifically, we employ the projected gradient descent (PGD) method and the successive convex approximation (SCA) technique to achieve computationally efficient sub-optimal solutions. Furthermore, we thoroughly analyze the convergence and computational complexity of the proposed algorithm. Extensive simulations and AirSim-based experiments are conducted to validate the superiority of our proposed approach compared to the baseline schemes in terms of control performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Soft-Coherent Direct Multipath SLAM</title>
  <link>https://arxiv.org/abs/2604.19723</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.19723v3 Announce Type: replace Abstract: Challenging indoor and urban environments with severe multipath propagation and obstructed line-of-sight degrade classical radio positioning. Multipath-based simultaneous localization and mapping (MP-SLAM) addresses this by building and exploiting propagation maps for robust localization. Emerging distributed multiple-input multiple-output (D-MIMO)/extremely large-scale MIMO (XL-MIMO) infrastructures provide large spatial apertures and high-resolution sensing, especially when phase coherence is maintained across base stations, subarrays, or distributed arrays. We propose a scalable Bayesian direct MP-SLAM method for coherent data fusion in D-MIMO/XL-MIMO systems that jointly infers the environment while performing robust, high-accuracy localization directly from raw radio signals. While commonly used zero-mean Type-II likelihood functions inherently lead to noncoherent processing across distributed arrays and thus to aperture loss, the proposed phase-preserving nonzero-mean Type-II likelihood shares a complex mean across distributed arrays. This enables coherent fusion and preserves the distributed aperture gain, while the variance captures noncoherent signal power. The method is combined with a surface model that enables map-feature fusion across the distributed infrastructure and supports near-field propagation and visibility effects. Bayesian inference is performed using belief propagation by means of the sum-product algorithm on a factor graph with particle-based messages. Parallelizing over particles and arrays, the GPU-accelerated implementation achieves millisecond-level runtimes even in large or distributed infrastructures. Simulation results show that the proposed method achieves performance gains over existing noncoherent methods and approaches the corresponding posterior CRLB, highlighting the potential of coherent processing for high-resolution sensing and localization.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Algebraic Diversity: Principles of a Group-Theoretic Approach to Signal Processing</title>
  <link>https://arxiv.org/abs/2604.19983</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.19983v5 Announce Type: replace Abstract: We present principles of algebraic diversity (AD), a group-theoretic approach to signal processing exploiting signal symmetry to extract more information per observation, complementing classical methods that use temporal and spatial diversity. The transformations under which a signal&#39;s statistics are invariant form a matched group; this group determines the natural transform for analysis, and averaging an estimator over the group action reduces variance without requiring additional snapshots. The viewpoint is broadened in five directions beyond the single-observation measurement of a companion paper. Rank promotion admits AD on scalar data streams and identifies the law of large numbers as the trivial-group case of a $(G, L)$ continuum combining sample-count with group-orbit averaging. An eigentensor hierarchy handles signals with nested symmetry. A blind group-matching methodology identifies the matched group from data via a polynomial-time generalized eigenvalue problem on the unitary Lie algebra, placing the DFT, DCT, and Karhunen--Lo\`{e}ve transforms as distinguished points on a transform manifold. A cost-symmetry matching principle then extends AD from measurement to blind and adaptive signal processing generally; blind equalization is given as a detailed example, with the Constant Modulus Algorithm&#39;s residual phase ambiguity predicted analytically and matched within two degrees on 3GPP TDL multipath channels, and other blind problems in signal processing are mapped into the framework. Four theorems formalize a structural capacity $\kappa$, the R\&#39;{e}nyi-2 analog of Shannon and von Neumann&#39;s R\&#39;{e}nyi-1 entropies, quantifying how a signal&#39;s information is organized rather than how much information it contains. AD relationship to prior algebraic approaches including invariant estimation, minimax robust estimation, algebraic signal processing, and compressed sensing.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Clinical Utility and Feasibility of Smartphone-based EEG in Kenya: A Multicenter Observational Study</title>
  <link>https://arxiv.org/abs/2605.08157</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.08157v3 Announce Type: replace Abstract: Purpose: Access to electroencephalography (EEG) remains limited across low- and middle-income countries (LMICs) due to cost, infrastructure requirements, and a shortage of trained staff. This study evaluated the feasibility and clinical utility of a smartphone-based EEG system in a real-world setting. Methods: We conducted a multicenter observational study (November 2023 to April 2026) across 29 clinical sites in Kenya. A smartphone-based 27-lead EEG system enabled trained healthcare workers to acquire standardized recordings with remote expert interpretation. Results: 3,036 EEG sessions were performed. Male patients constituted 57.8% of the cohort, with representation across pediatric and adult populations. The most common referral indication was seizures or convulsions (68.5%). Overall, 2,915 (96%) recordings were interpretable, while 121 (4%) were uninterpretable, primarily due to high electrode impedance and insufficient recording duration. Uninterpretable recordings were significantly shorter than interpretable recordings (mean 18.5 vs. 33.8 minutes; median 15.1 vs. 31.6 minutes; p &lt; 0.0001). Mean turnaround time for interpretation was 107 minutes. Among interpretable recordings, 917 (30.2%) were abnormal, including 701 (76.4%) with epileptiform abnormalities, 215 (23.4%) with non-epileptiform findings, and 1 (0.1%) indeterminate finding. Epileptiform abnormalities were highest in children aged 4-9 years (33.1%) and less frequent in adults (14-21%). Non-epileptiform abnormalities were more common in patients aged 60+ years (19.2%) compared to younger age groups (3-9%). Conclusion: Large-scale, point-of-care EEG acquisition by non-specialist operators in a resource-limited setting is feasible. Expansion of smartphone-based EEG systems may improve equitable access to neurological diagnosis and care in LMICs.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Communication Security and Sensing Privacy in FMCW-Based ISAC Through Signal Modulation</title>
  <link>https://arxiv.org/abs/2605.23429</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.23429v3 Announce Type: replace Abstract: This study proposes a novel radar-centric signaling design and architecture for secure integrated sensing and communication (ISAC) systems. The proposed framework is designed to provide robust physical layer security for data transmission while simultaneously enhancing sensing privacy. It employs index modulation and phase coding over frequency-modulated continuous-wave radar (FMCW) chirps, where index modulation (IM) provides an outer layer of data security, and we explicitly design the phase coding (PC) to perturb the resulting signal&#39;s ambiguity function (AF) to enhance sensing privacy. This design reduces the risk of unauthorized surveillance by rendering target velocity estimation practically infeasible for unauthorized passive sensing hardware (i.e., a sensing eavesdropper, S-Eve) and significantly impairing its range estimation capabilities. Furthermore, this study also presents the transmitter and receiver architectures required for effective modulation and demodulation of the proposed ISAC signaling and for performing sensing at the legitimate sensing hardware. Simulation results show that the proposed approach achieves high data throughput while enhancing communication security and sensing privacy.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>FMCW-Based Integrated Sensing and Communication System: Design, Implementation, and Experimental Measurements</title>
  <link>https://arxiv.org/abs/2605.23564</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.23564v2 Announce Type: replace Abstract: This study proposes a radar-centric integrated sensing and communication (ISAC) system utilizing a two-layer modulation scheme for vehicular networks. Frequency-modulated continuous wave (FMCW) chirps are jointly modulated via phase modulation (PM) and index modulation (IM) to transmit data while maintaining sensing as the primary function. To support this, a novel radar signal processing technique is developed to mitigate the impacts of IM and PM on sensing accuracy, alongside a communication receiver architecture designed to successfully demodulate IM and PM data within FMCW chirps. System performance is evaluated through simulations in the 2.4 GHz and 24 GHz bands under Doppler effects, achieving communication throughputs of 25 Mbps and 50 Mbps, respectively. Furthermore, a proof-of-concept hardware implementation is realized, and experimental measurements via a loopback cable are performed to verify the feasibility of the architecture. Finally, it evaluates the fundamental trade-off between communication throughput, sensing accuracy, and out-of-band emission, demonstrating the system&#39;s flexibility to dynamically adjust waveform parameters to meet varying operational requirements.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>TGSD: Topology-Guided State-Space Diffusion Framework for EEG Spatial Super-Resolution</title>
  <link>https://arxiv.org/abs/2606.03998</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03998v2 Announce Type: replace Abstract: Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at https://github.com/jtggz/TGSD.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Integrated Real-Time Testbed for Wideband RFID and Wireless Power Transfer</title>
  <link>https://arxiv.org/abs/2606.04207</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04207v2 Announce Type: replace Abstract: This contribution presents an experimental integrated real-time 8 x 8 distributed MIMO (D-MIMO) testbed for wideband backscatter communication (BSC) and wireless power transfer (WPT). The testbed operates in the 2.45 GHz band with coherent sampling at 200 MS/s, employs a backscatter link frequency of 40 kHz, and uses wideband 5G NR reference signals for excitation. We evaluate the testbed by exploiting the estimated channel state information (CSI) in two target applications: wireless power transfer towards the backscatter device (BD) and real-time positioning of a BD in an indoor environment. In conjunction with the baseband processing chain introduced, the testbed requires less than 2 ms of total airtime to excite the system and acquire the signals for subsequent synchronization and CSI estimation on uplink BSC signals. With the CSI, we demonstrate effective energy harvesting gains of up to 12 dB.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>VaN3Twin: the Multi-Technology V2X Digital Twin with Ray-Tracing in the Loop</title>
  <link>https://arxiv.org/abs/2505.14184</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2505.14184v3 Announce Type: replace-cross Abstract: This paper presents VaN3Twin-the first open-source, full-stack Network Digital Twin (NDT) framework for simulating the coexistence of multiple Vehicle-to-Everything (V2X) communication technologies with accurate physical-layer modeling via ray tracing. VaN3Twin extends the ms-van3t simulator by integrating Sionna Ray Tracer (RT) in the loop, enabling high-fidelity representation of wireless propagation, including diverse Line-of-Sight (LoS) conditions with focus on LoS blockage due to other vehicles&#39; meshes, Doppler effect, and site-dependent effects-e.g., scattering and diffraction. Unlike conventional simulation tools, the proposed framework supports realistic coexistence analysis across DSRC and C-V2X technologies operating over shared spectrum. A dedicated interference tracking module captures cross-technology interference at the time-frequency resource block level and enhances signal-to-interference-plus-noise ratio (SINR) estimation by eliminating artifacts such as the bimodal behavior induced by separate LoS/NLoS propagation models. Compared to field measurements, VaN3Twin reduces application-layer disagreement by 50% in rural and over 70% in urban environments with respect to current state-of-the-art simulation tools, demonstrating its value for scalable and accurate digital twin-based V2X coexistence simulation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Double-Directional Wireless Channel Modeling Using Statistics-Aided Machine Learning</title>
  <link>https://arxiv.org/abs/2606.05993</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05993v1 Announce Type: cross Abstract: The double-directional (DD) wireless channel model is important for realistic system design since it provides complete propagation information. While stochastic and deterministic channel models are widely adopted, and existing machine learning (ML) solutions mostly aim to align future channel realizations, these solutions are often limited to short time spans that may not be statistically significant. Moreover, because the number of multi-path components (MPCs) varies with spatial and temporal variation of the receiver (RX) and/or interacting objects (IOs), typical ML solutions that require fixed, predefined input and output shapes fall short. To curb these limitations, we propose a statistics-aided ML solution that relies on a fixed subset of MPCs selection. More specifically, we first select top-$M$ MPCs, where $M\in\mathbb{Z}^+$ is much smaller than the total number of MPCs, and construct learnable graphs to train our proposed hybrid TimesNet-TimeFilter (TNTF) model. We then use a channel statistics-aided training method to generate future top-M DD channel realizations such that the statistics calculated from these realizations matches closely with those of the actual statistics from the complete time-varying DD channel realizations. We validate the proposed solution using extensive simulations on both synthetic stochastic channel model (SCM)-based and deterministic ray-tracing-based datasets, and demonstrate its effectiveness relative to state-of-the-art baselines.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Impact of RTK Augmentation and INS Integration on GNSS Positioning Accuracy and Continuity: A Benchmarking Study on Inland Waterways</title>
  <link>https://arxiv.org/abs/2606.06358</link>
  <pubDate>Fri, 05 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.06358v1 Announce Type: cross Abstract: RTK augmentation andINS integration are widely used to improve GNSS positioning performance. However, on inland waterways, bridges and surrounding structures can degrade satellite visibility and correction availability, causing RTK augmentation loss, and GNSS/INS fusion transients. Since these effects depend on the local environment and sensor configuration, nominal receiver specifications are insufficient, and deployment-specific characterization is required. This paper presents a benchmarking study of an AsteRx-i3 D Pro+ GNSS/INS receiver installed within the mobile Sensor Box developed at KU Leuven. The study combines a real-world bridge-passage case study, static benchmarking, and closed-loop path-following experiments. The static benchmarking evaluates four receiver configurations: standalone GNSS, standalone GNSS with INS integration, RTK-augmented GNSS, and RTK-augmented GNSS with INS integration. The closed-loop experiments use INS-integrated GNSS as the navigation input and compare path-following operational performance with and without RTK augmentation. Results show that correction loss during bridge passage causes reduced positioning accuracy, increased positioning uncertainty and recovery-induced state jumps exceeding 1 m. Static benchmarking and closed-loop experiments confirm that RTK augmentation substantially improves positioning precision and uncertainty consistency, while INS integration supports short-term continuity during RTK unavailability but may introduce drift, bias, or transient uncertainty variations. By characterizing the deployment-specific receiver behavior with RTK augmentation and INS integration, this study motivates higher-level state estimation as a necessary next step toward spatially continuous and uncertainty-consistent positioning on inland waterway. The experimental data are released at: https://doi.org/10.5281/zenodo.20541733.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Neural Radiated-Noise Fields for Unmanned Underwater Vehicle Noise Spectrum Prediction in Three-Dimensional Scenes</title>
  <link>https://arxiv.org/abs/2606.04008</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04008v1 Announce Type: new Abstract: Radiated noise in unmanned underwater vehicles (UUVs) is an important indicator for characterizing acoustic signatures and evaluating platform performance. To address the strong dependence of traditional physics-based modeling and numerical simulation methods on target structural information and environmental boundary conditions, and their inability to achieve continuous spatial spectrum-response modeling in three-dimensional scenes, this paper proposes a neural radiated-noise field (NRNF). An NRNF represents the UUV radiated-noise spectrum as a continuous function of the three-dimensional UUV position, the three-dimensional hydrophone position, the UUV yaw angle, and the frequency, enabling query-based prediction at arbitrary spatial locations. The proposed method employs sinusoidal encoding for position and frequency, and introduces a learnable three-dimensional scene feature grid to explicitly represent environmental structure and propagation effects. A spectrum-prediction dataset is constructed from lake trials, and the proposed model is evaluated under three settings: horizontal extrapolation, depth extrapolation, and cross-run generalization. Results show that the NRNF achieves an average prediction error of 3.5 dB in the 50 to 5000 Hz band. Horizontal extrapolation is easiest, depth extrapolation is the most challenging, and cross-run generalization is of intermediate difficulty. Further ablation results demonstrate that the scene feature grid significantly improves the prediction stability and spatial generalization of the model.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Encounter Geometry Effects on Space-Based Laser Debris Remediation and Estimation</title>
  <link>https://arxiv.org/abs/2606.04942</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04942v1 Announce Type: new Abstract: The escalating accumulation of orbital debris poses a critical threat to future space operations. Space-based lasers leveraging laser ablation have emerged as a promising approach for mitigating debris proliferation and preserving the orbital environment. Current literature, however, treats space-based laser debris remediation as a deterministic problem, assuming that momentum transfer and the resulting debris perturbations are precisely known. In reality, laser-to-debris engagement outcomes are inherently stochastic due to partially known debris characteristics. Compounding this challenge, estimating critical laser-matter parameters in situ, such as the momentum coupling coefficient, requires ablation that consequently perturbs the debris trajectory. This establishes a coupled ablation-and-estimation problem in which the laser platform and target debris encounter geometry influences remediation effectiveness and estimation accuracy. To address this problem, we present a joint ablation-and-estimation methodology that provides insights into the driving factors that make different encounter geometries improve or degrade overall remediation and estimation performance. Results across multiple coplanar and out-of-plane encounter geometries demonstrate how periapsis-lowering capacity, linear system observability, and nonlinear estimation performance evolve as laser parameters and relative orbit geometry vary. By identifying the key drivers behind these metrics, this study highlights critical considerations for the safe and effective operation of space-based lasers under uncertainty.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Access Protocols for Segmented Waveguide-Enabled Pinching-Antenna Systems (SWANs)</title>
  <link>https://arxiv.org/abs/2606.04913</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04913v1 Announce Type: new Abstract: This paper proposes an access protocol framework for segmented waveguide-enabled pinching-antenna systems (SWANs), which exploits SWAN-induced reconfigurable channel diversity as a protocol-level resource for uplink random access. The framework consists of two stages, a channel-oracle stage and an access stage, designed under three SWAN operating modes: (i) one-segment selection (OS), (ii) segment aggregation (SA), and (iii) segment multiplexing (SM). Specifically, in the channel oracle stage, the OS mode is adopted to acquire sparse pilot observations and infer the channel responses across the SWAN configuration space. In this way, high-dimensional uplink channel acquisition is recast as a low-dimensional geometric localization problem, thereby reducing pilot overhead while preserving channel reconstruction accuracy. For the access stage, we construct two oracle-guided access codebooks under the SA and SM modes, respectively, which address the tradeoff between hardware complexity and multiuser access resolution. In particular, the SA-based scheme supports single radio frequency (RF) chain access through randomized segment-group activation, whereas the SM-based R-access scheme exploits multiple RF chains to construct deterministic access slots and enhance collision resolution. Finally, our numerical results demonstrate that (i) the proposed two-stage framework improves access performance under the same training overhead, (ii) anchor densification is more effective than aggressive segment aggregation for SA, and (iii) SM-based R-access achieves deterministic coverage and higher throughput in moderate- and high-load regimes, whereas SA-based access remains attractive for low-complexity implementations.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>WiSER: A Wireless Scene Encoder for Geometry-Grounded Multi-View Wireless Prediction</title>
  <link>https://arxiv.org/abs/2606.04770</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04770v1 Announce Type: new Abstract: Indoor wireless propagation is governed by the interaction among three-dimensional (3D) scene geometry, radiomaterial properties, and transmitter and receiver configuration, which jointly determine both aggregate coverage behavior and path-level multipath structure. However, most learning-based site-specific prediction methods are designed for a single wireless representation, such as radiomap estimation or channel impulse response (CIR) prediction, and therefore do not explicitly exploit the propagation structure shared across heterogeneous wireless views. This paper introduces WiSER, a Wireless Scene Encoder for joint radiomap and multipath CIR prediction. WiSER maps a sparse voxel representation of an indoor scene and a transmitter location into a transmitter-conditioned sparse 3D scene memory, which is queried by two structure-aware decoders: a ray-corridor decoder for dense receiver-plane path-gain prediction and a Detection Transformer (DETR)-style set decoder for variable cardinality delay and power tap prediction. To train and evaluate this setting, we construct a co-registered indoor scene and wireless dataset pipeline using ScanNet++ indoor scenes and Sionna Ray Tracing, producing aligned sparse voxel inputs, dense radiomap labels, and unordered multipath CIR tap sets under a common coordinate frame and propagation configuration. Experimental results show that WiSER outperforms scene-specific radiomap baselines and substantially improves matched delay and power prediction over reference CIR baselines. These results suggest that transmitter-conditioned sparse 3D scene representations can serve as reusable wireless scene encoders for heterogeneous propagation queries, providing a geometry-grounded step toward representation learning and foundation-model development for AI-native wireless systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Ultra-precise TDoA-based Localization of Frequency Hopping LPWAN Transmitters</title>
  <link>https://arxiv.org/abs/2606.04756</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04756v1 Announce Type: new Abstract: The Internet of Things (IoT) is a highly emerging market. It serves as a key enabler for a variety of applications like the digital twin or asset tracking in industrial scenarios. This often requires the provision of precise position information. However, systems like Global Navigation Satellite Systems (GNSS) are ruled out due to high energy costs and indoor applications. A variety of systems is discussed to close this gap. In order to contribute to the investigations of possible gold standards, this paper discusses the localization based on Low Power Wide Area Networks (LPWAN). Therefore, a concept is presented, based on Time Difference of Arrival (TDoA) measurements within the LPWAN standard ETSI TS 103 357. This paper addresses two major challenges. At first, TDoA measurements require highly precise temporal synchronization of the receiving base stations. Within this work, this issue is solved by exploiting Signals of Opportunity (SoO) as synchronization source, enabling sub-meter synchronization accuracy. A further issue arises from the Frequency Hopping (FH) waveform of the transmitting endpoints, resulting in a loss of phase information and thus usable localization bandwidth. A method is introduced to overcome this limitation. This paper states the system concept, proves its functionality in theoretical investigations and simulations. Finally, real-world measurements verify the functionality and show a 2D localization accuracy of below 10 m in Line of Sight (LOS) scenarios.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Adaptive $c_2$-Perturbed AFDM Waveform Design for Integrated Sensing and Communication</title>
  <link>https://arxiv.org/abs/2606.04698</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04698v1 Announce Type: new Abstract: Affine frequency division multiplexing (AFDM) is a promising waveform for integrated sensing and communication (ISAC) systems owing to its superior performance in time--frequency doubly dispersive channels. However, AFDM still faces a pair of challenges: high PAPR and random data symbols produce imperfect autocorrelation sidelobes. To address these challenges, this paper proposes a real-time data-driven framework that optimizes the pre-chirp parameter $c_2$ to enhance the AFDM-ISAC performance. Specifically, a side-information-free optimization problem is formulated to reduce PAPR and the weighted integrated sidelobe levels of both aperiodic and periodic autocorrelation functions, with complexity comparable to that of the conventional AFDM receiver. Furthermore, an efficient non-monotone line-search spectral projected-gradient algorithm is developed by exploiting closed-form gradients. Simulation results demonstrate that the proposed method achieves a superior sensing vs. communications trade-off and is capable of striking a promoted bit error rate performance in the presence of severe power amplifier nonlinearity.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Joint 3D Trajectory and Power Allocation for HAPs-UAV Bistatic ISARAC in Low-Altitude Networks</title>
  <link>https://arxiv.org/abs/2606.04600</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04600v1 Announce Type: new Abstract: This paper investigates joint three-dimensional (3D) trajectory planning and resource allocation for a high-altitude platform (HAPs)-unmanned aerial vehicle (UAV) bistatic integrated synthetic aperture radar (SAR) and communication (ISARAC) system in low-altitude networks. In the proposed architecture, the HAPs provides persistent wide-area connectivity by transmitting ISARAC waveforms for ground-user communications, while a low-altitude UAV exploits its proximity and mobility to passively collect ground-target echoes for high-resolution SAR imaging. We formulate a sum-rate maximization problem for ground users subject to stringent SAR imaging signal-to-noise ratio (SNR) and resolution requirements, a total energy budget for ISARAC transmission, and UAV dynamic constraints. The resulting problem is inherently nonconvex. To tackle it, an alternating optimization (AO) framework is developed, where the power-allocation subproblem with fixed UAV states admits a closed-form water-filling solution, while the UAV trajectory optimization with fixed transmit powers is handled via successive convex approximation (SCA) and difference-of-convex (DC) programming. Simulation results verify the effectiveness of the proposed approach and demonstrate its capability to jointly support persistent communication coverage and high-resolution sensing in low-altitude network scenarios.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Robust Set-Membership Diffusion Normalization Subband Adaptive Filtering Algorithms Over Distributed Networks</title>
  <link>https://arxiv.org/abs/2606.04553</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04553v1 Announce Type: new Abstract: With the development of wireless sensor networks, distributed networks have received widespread attention. According to the different ways of connecting the nodes in the distributed network can be divided into different structures, of which the diffusion type structure is the most commonly used one due to its simple, stable and reliable. In order to improve the robustness of the diffusion subband algorithm in distributed networks, the median absolute deviation (MAD) theorem is applied to the error boundary selection, and this paper proposes a diffusion subband algorithm with a robust boundary. Through simulations, it is verified that the proposed algorithm can effectively reduce the update step size in the face of outlier interference, so that the algorithm has a good convergence performance and also has good robustness to impulsive noise.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Microwave Linear Analog Computers Aided Multiuser Communication: General Impedance Matching and Precoding Optimization</title>
  <link>https://arxiv.org/abs/2606.04532</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04532v1 Announce Type: new Abstract: Microwave linear analog computers (MiLACs) have recently emerged as a hardware-efficient solution for implementing multi-antenna communication systems. Unlike existing MiLAC designs based on the ideal assumption of perfect impedance matching (PIM) with reflection-free transmission, this paper investigates MiLAC-aided precoding optimization under a general impedance matching (GIM) model, which enables more flexible precoder design at the cost of a potential reduction in radiated power. Specifically, we consider a downlink multiuser multiple-input single-output (MISO) communication system and aim to maximize the system sum rate by optimizing the MiLAC-enabled transmit precoding subject to physical circuit constraints. The formulated problem is challenging to solve due to the intricate coupling between the precoding and impedance parameters. To address this challenge, we first develop a singular value decomposition (SVD)-based parametric search framework for small or medium size systems. This framework exploits the feasible precoder structure and explicitly captures the tradeoff between power radiation efficiency and precoder design flexibility. We then propose a unified algorithm for solving the optimization problem based on the projected weighted minimum mean-square error (WMMSE) principle for arbitrary size systems with GIM- or PIM-based MiLAC precoding. Simulation results demonstrate that the GIM-based MiLAC design consistently outperforms its PIM counterpart as a special realization, especially in interference-limited scenarios, by allowing a moderate reduction in radiated power in exchange for additional precoder design flexibility and more effective interference mitigation. It is also shown that GIM-based MiLAC design achieves performance close to that of the baseline fully digital precoding system.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Gaussian-Process Dynamics of Diagonal Expectation Propagation under Variance-Profile Gaussian Measurements</title>
  <link>https://arxiv.org/abs/2606.04531</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04531v1 Announce Type: new Abstract: State-evolution analyses of approximate-message-passing and expectation-propagation-type algorithms rely on an effective-channel principle: after a suitable Onsager, orthogonal, or extrinsic correction, the nonlinear module receives a fresh scalar Gaussian observation. This paper studies this principle for diagonal expectation propagation under variance-profile Gaussian sensing matrices. The model preserves Gaussian conditioning, but removes the isotropy that supports the usual scalar decoupling arguments. We prove a finite-time large-system description in which the linear EP module remains Gaussian at the coordinate level, but is generally not a fresh scalar channel. Instead, the residuals form a coordinate-dependent Gaussian process whose covariance is shaped by the variance profile and by the finite linear history of the algorithm. The standard diagonal EP cavity cancels the instantaneous response of the incoming message, but may leave a component predictable from past residuals. We characterize this process through a conditioned matrix-Dyson-equation deterministic equivalent and a Schur-complement representation of the linear module. A Gaussian-regression decomposition then separates the predictable memory from the orthogonal innovation and yields an oracle state-evolution-level correction. Thus, under variance-profile measurements, the limiting object for diagonal EP is a Gaussian-process dynamics with profile-dependent memory rather than the conventional fresh-noise scalar state evolution.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>How Many Bits Are Required for RIS Designs Without Far-Field Quantization Lobe?</title>
  <link>https://arxiv.org/abs/2606.04156</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04156v1 Announce Type: new Abstract: Reconfigurable Intelligent Surface (RIS) designs with 1-bit phase resolution often suffer from strong quantization lobes in the far field, which significantly degrade wireless communication performance. This work investigates the minimum phase resolution required for RIS to eliminate far-field quantization lobe. The analysis demonstrates that 2-bit phase discretization offers an optimal balance between performance and hardware complexity. A practical 2-bit RIS unit cell is designed, and a 20 x 20 array configuration is implemented to evaluate its performance. The quantization-lobe suppression capability is validated through full-wave radar cross-section (RCS) simulations under plane-wave illumination for the entire RIS array. The fabricated prototype is further characterized experimentally, achieving a -13.1 dB quantization-lobe level compared to -0.8 dB for its 1-bit counterpart, confirming both the analytical and full-wave simulation results.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Spatial-Spectral Modeling of the Array Pattern of a Two-Element Dynamic Antenna Array with Differential Amplitude Modulation</title>
  <link>https://arxiv.org/abs/2606.04102</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04102v1 Announce Type: new Abstract: We present a theoretical model for a two-element dynamic phased array and characterize the transfer of information as a function of angle. The array is based on a two-state switched structure with phase shifting to support beamsteering. Dynamic motion of the phase center of antenna arrays generates time-varying radiation patterns that, when appropriately designed, support directional modulation, or the transfer of information to regions of space that are narrower than that covered by the energy radiated by the array. We evaluate the impact of switching frequency and steering on the spatial width of the information beam, which is the region of space where information is recoverable. The concepts are evaluated through simulation and experiment using a 0.75$\lambda$ two-element array operating at 2.5 GHz.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>SkySense: A Semi-Supervised Generative Framework for UAV Localization in ISAC Networks</title>
  <link>https://arxiv.org/abs/2606.04076</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04076v1 Announce Type: new Abstract: Extreme data scarcity and inherent multipath spatial ambiguity severely limit existing deep learning-based channel state information (CSI) fingerprinting localization schemes for target unmanned aerial vehicles (UAVs). To overcome these challenges, we propose an end-to-end semi-supervised generative localization framework. First, by exploiting the temporal correlations inherent in continuous flight trajectories, a self-supervised encoder extracts robust spatial features from massive unlabeled CSI sequences to establish structured latent representations. Following this, we utilize a consistency model, a powerful derivative of diffusion architectures, as the core generative backbone to map the learned latent space to physical coordinates, jointly fine-tuning the pre-trained encoder with a strictly limited set of labeled CSI. This consistency formulation models the conditional distribution to resolve the mean collapse problem of discriminative models, while compressing the inference trajectory to 1-2 steps to avoid the latency bottleneck of traditional diffusion models. Furthermore, a lightweight distributed fusion mechanism is designed to aggregate spatial predictions across multiple base stations (BS) from a multi-view geometry perspective. Comprehensive evaluations on a real-world measurement dataset demonstrate that our framework achieves low latency and suppresses the mean localization error to 9.77 cm under a 3-BS fusion setup with only a 1\% label fraction, significantly outperforming existing fully supervised and semi-supervised discriminative baselines.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition</title>
  <link>https://arxiv.org/abs/2606.04019</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04019v1 Announce Type: new Abstract: Recent studies on sensor-language alignment have shown that two-stage frameworks can improve the semantic modeling ability of wearable-sensor human activity recognition (HAR), where SensorLLM-style methods first perform motion-to-language alignment and then fine-tune the model for downstream tasks. However, our experiments reveal a consistent failure mode when the Stage 2 backbone is compressed to a compact model such as TinyLlama: recognition of dynamic activities remains relatively strong, while the discrimination of low-motion static classes such as standing, sitting, and lying degrades substantially. To address this issue, we propose a gravity-aware hierarchical routing head as a lightweight post-alignment adaptation built on top of an already aligned model, rather than a new large-scale pretraining framework. The method uses the per-channel mean and std from the Chronos tokenizer state to extract statistical cues related to posture and gravity direction, and adaptively combines a static expert and a full expert through soft routing, together with a load-balancing loss for stable training. On the MHealth dataset, this design significantly improves macro-F1 with minimal parameter overhead, and the gains are concentrated mainly on static classes while preserving strong performance on dynamic activities. As a first arXiv disclosure, the current paper reports results on a single dataset only, with the goal of highlighting the core method and laying the groundwork for broader evaluation in future work.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>GenED-SC: Generative Editing Semantic Communication with Integrated Multi-Modal LLMs</title>
  <link>https://arxiv.org/abs/2606.04015</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04015v1 Announce Type: new Abstract: Deep learning-based joint source-channel coding has recently demonstrated strong potential for semantic communication (SemComm). However, most existing approaches focus on optimizing visual-fidelity metrics, which can lead to reduced perceptual quality. Generative model-based SemComm leverages rich prior knowledge from large-scale pre-training to enhance perceptual quality, but often at the cost of increased distortion and unreliability. This paper addresses the above issues by proposing a two-stage semantic image transmission framework, integrating a multimodal large language model (MLLM) for generative editing. In the first stage, a JSCC-based discriminative transmission selectively prioritizes semantically important regions, preserving scene layout and object integrity under limited bandwidth. In the second phase, MLLM-driven generative editing refines missing details based on the textual descriptions, enhancing semantic fidelity and perceptual quality. Extensive experiments show that the proposed framework achieves state-of-the-art performance in semantic preservation, perceptual quality, and visual fidelity across a wide range of channel conditions, especially in low-SNR regimes.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Distortion-Aware UAV Placement for Aerial Semantic Relay Communications: An Analytical Approach</title>
  <link>https://arxiv.org/abs/2606.04013</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04013v1 Announce Type: new Abstract: Aerial semantic relay communications (SRC) employs an unmanned aerial vehicle (UAV) equipped with a semantic encoder as a relay, which not only extends the data acquisition coverage of the base station (BS) from resource-limited sensing device (SD) but also enhances communication efficiency through semantic feature transmission over the UAV-BS link. Existing works mainly focus on sum-rate maximization, overlooking the end-to-end reconstruction distortion of sensory data in UAV-assisted SRC systems. Optimizing the UAV placement is crucial for minimizing the end-to-end reconstruction distortion, as it fundamentally trades off the input perturbation at the UAV-side encoder against that at the BS-side decoder through the two-hop wireless channel conditions. In this paper, we propose an interpretable and efficient UAV placement policy by minimizing end-to-end reconstruction distortion in aerial SRC. This is a challenging task since the black-box nature of the DNN-based codecs and the intricate coupling between the heterogeneous codec sensitivities, along with two-hop channel impairments, render the end-to-end distortion analytically intractable to characterize. We first derive an analytical expression of the end-to-end distortion, explicitly revealing the impact of cross-hop perturbation coupling, wireless channel and radio resource on the reconstruction error. Based on that, we develop a closed-form UAV placement strategy with fast adaptability across various aerial SRC system configurations. Numerical results demonstrate that the proposed distortion-aware UAV deployment closely tracks the empirical exhaustive-search optimum, while achieving lower distortion compared to representative capacity-based and curve-fitting benchmarks.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A sharp analysis of Root-MUSIC: locations of correct and extraneous roots</title>
  <link>https://arxiv.org/abs/2606.04003</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04003v1 Announce Type: new Abstract: Root-MUSIC is a spectral estimation algorithm that approximates the unknown signal frequencies by constructing a high-degree polynomial and finding a subset of roots which are closest to the complex unit circle. Previous works found asymptotic expectation formulas for the performance of Root-MUSIC under the implicit assumption that the aforementioned root selection criterion does not select extraneous roots -- those which are unrelated to the correct parameters. This paper removes the need for this assumption by showing all extraneous roots lie outside an annulus of a certain thickness and therefore are not selected by the algorithm. This paper also provides sharp, non-asymptotic, and explicit error bounds for the correct roots in terms of fundamental model parameters. All results hold under a natural separation condition on the correct signal frequencies and are applicable in both the single- and multi-snapshot models. More specifically, in the multi-snapshot model, we prove that Root-MUSIC estimates the frequencies with error at most $O(\sigma /(m \sqrt n))$, where $\sigma^2$ is the noise variance, $m$ is the number of sensors, and $n$ is the number of snapshots. A novelty of this non-asymptotic bound is the explicit $1/m$ decay, which indicates that there is a significant advantage in utilizing additional sensors. Numerical simulations confirm our theory. The main mathematical insight of this paper is a geometric property of the Root-MUSIC polynomial: its correct roots are highly stable to noise while its extraneous roots must lie outside of an annulus.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Geometry-Structured Channel Reconstruction for Conventional and Fluid Antenna Systems: Bayesian Inference and Fundamental Limits</title>
  <link>https://arxiv.org/abs/2606.04001</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04001v1 Announce Type: new Abstract: Accurate channel state information (CSI) acquisition is critical for exploiting the spatial flexibility of fluid antenna systems (FASs). However, port selection and transmission optimization require CSI over a large number of candidate port positions, making direct port-wise estimation prohibitively costly in terms of pilot overhead. This paper addresses this challenge through geometry-structured channel reconstruction, which exploits the fact that the port-domain CSI can be parameterized by a small number of dominant propagation paths. We first establish fundamental mean square error (MSE) and normalized MSE (NMSE) benchmarks for both geometry-structured and unstructured channel reconstruction, providing analytical references for evaluating the intrinsic benefit of geometric modeling in conventional antenna systems and FASs. Motivated by the strong spatial correlation induced by densely distributed fluid antenna ports, we further propose a Bayesian reconstruction framework, termed geometry-structured expectation-maximization approximate message passing (GS-EM-AMP). The proposed algorithm incorporates geometric channel structure into the EM-AMP procedure and adaptively learns unknown statistical parameters from noisy observations. Numerical results demonstrate that GS-EM-AMP achieves near-bound reconstruction accuracy while maintaining strong robustness against steering-domain correlation, thereby offering an efficient and reliable solution for large-scale CSI acquisition in FASs.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Airy Beam Dispersion in Near-Field Wideband Terahertz Communications</title>
  <link>https://arxiv.org/abs/2606.03999</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03999v1 Announce Type: new Abstract: This letter investigates Airy beam dispersion in near-field wideband terahertz communications. Unlike conventional focusing beams, whose dispersion mainly appears as focal-point migration, Airy beams exhibit frequency-dependent shifts of both the reference focusing point and the self-bending main-lobe trajectory. Based on the Fresnel diffraction integral, a closed-form trajectory expression is derived to characterize the dispersion behavior across subcarriers. Furthermore, a true-time-delay (TTD)-assisted Airy beamforming structure is developed to actively control the trajectory dispersion. By properly designing the time delay parameters, the proposed scheme can either generate frequency-dependent curved trajectory clusters for sensing-oriented scanning or suppress trajectory drift for reliable communication.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction</title>
  <link>https://arxiv.org/abs/2602.23214</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2602.23214v2 Announce Type: replace-cross Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion (DC-PnPDP), which restores the classical dual variable to provide integral feedback, progressively enforce agreement between the data-consistency and prior. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver&#39;s rigorous optimization trajectory with the denoiser&#39;s valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence. The code is available at https://github.com/duchenhe/DC-PnPDP</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Platonic Transformers: A Solid Choice For Equivariance</title>
  <link>https://arxiv.org/abs/2510.03511</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.03511v3 Announce Type: replace-cross Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers</title>
  <link>https://arxiv.org/abs/2606.05149</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05149v1 Announce Type: cross Abstract: Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement</title>
  <link>https://arxiv.org/abs/2606.04249</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04249v1 Announce Type: cross Abstract: Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>3D-GlioPREDICT: 3D Latent Diffusion for Post-Radiotherapy Brain MRI Prediction in Patients with Glioma</title>
  <link>https://arxiv.org/abs/2606.05113</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05113v1 Announce Type: new Abstract: Radiotherapy is a cornerstone of glioma treatment inducing complex structural changes in brain tissue that are difficult to anticipate. Predicting these changes from pretreatment data could improve understanding of treatment-related effects and support the development of image-based outcome prediction methods. Recent studies have shown that follow-up brain magnetic resonance imaging can be synthesized from baseline imaging and treatment information, but most existing approaches operate on single 2D slices and represent treatment as a global parameter, rather than a spatially dynamic variable. In this work, we address both limitations with a 3D latent diffusion framework that conditions image generation on the spatially resolved voxel-wise dose distribution, alongside a pretreatment image and follow-up time. To make volumetric synthesis computationally feasible, the model combines latent-space compression with ControlNet-based spatial conditioning. The method was trained and evaluated on a public dataset comprising 257 scans from 25 glioma patients. Prediction quality was assessed using mean squared error, peak signal-to-noise ratio, and structural similarity index. Anatomical consistency was further evaluated using Dice scores for cerebrospinal fluid, gray matter, and white matter segmentations, together with hippocampus volume prediction error and deformation analysis based on log Jacobian determinant maps. Compared with our previously proposed 2D approach, the 3D model achieved improved image similarity while maintaining good agreement with ground truth anatomy and deformation patterns. Overall, these results support the feasibility of 3D treatment-aware generative modeling for predicting post-radiotherapy brain MRI using only pretreatment information. Code is available at https://github.com/nordinbelkacemi/fu-pred-3d</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>KD-NVC: A Search-and-Distill Framework to Accelerate Neural Video Coding</title>
  <link>https://arxiv.org/abs/2606.04595</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04595v1 Announce Type: new Abstract: While neural video coding (NVC) has achieved remarkable rate-distortion performance, real-time decoding on edge devices has become an important demand but remains limited by high complexity. Knowledge distillation (KD) is widely used for model acceleration, yet its application to NVC faces critical challenges. Specifically, the heterogeneity of NVC sub-modules renders uniform architectural reduction suboptimal, necessitating a per-module design for better rate-distortion-speed trade-off. However, searching for diverse architectures via existing neural architecture search (NAS) algorithms is unaffordable due to the expensive training cost of neural video codecs. Moreover, after the lightweight architecture is determined, existing distillation methods overlook the feature-energy sparsity induced by the rate-constraint, which is essential for maintaining compression performance. To address these issues, we propose a two-stage distillation framework KD-NVC. In the first stage, we introduce an acceleration-efficiency-based neural architecture search (AE-NAS) algorithm. It explores the module-wise Pareto frontier to adaptively allocate the acceleration budget across heterogeneous modules. Also, it introduces the acceleration-efficiency metric to determine the final student architecture without practically training all architecture-level candidates. In the second stage, we design an energy-aware feature distillation (EFD) loss that aligns the spatially-aggregated feature-energy signatures between the teacher and student codecs, transferring the rate-induced sparsity patterns for better compression efficiency. Experimental results demonstrate that the proposed framework consistently outperforms existing codec-oriented distillation methods, and achieves 69 FPS decoding at 1080p on RTX 5060 while maintaining comparable RD performance to VTM-LDB.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Scaling Datasets for Multi-Sensor, Multi-Agent, and Multi-Domain Learning in Autonomous Systems</title>
  <link>https://arxiv.org/abs/2606.04444</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04444v1 Announce Type: new Abstract: Existing datasets cannot support large-scale learning in multi-agent, multi-sensor, or multi-domain autonomy, where diversity and coordination are essential. We present a modular dataset generation pipeline that creates terabyte-scale, ground-truth-labeled data for ground, aerial, and infrastructure-based systems using the AVstack framework and CARLA simulator. Supporting single- and multi-agent configurations with flexible sensor suites, the pipeline enables controllable experimentation across challenging conditions. Representative perception and fusion studies show how generated data can support application-specific training and collaborative autonomy.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI</title>
  <link>https://arxiv.org/abs/2606.04419</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04419v1 Announce Type: new Abstract: MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>FUSE-Flow: A Decoupled Framework for Calibration and Stateless Real-Time Multi-View Point Cloud Fusion</title>
  <link>https://arxiv.org/abs/2606.04376</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04376v1 Announce Type: new Abstract: Real-time multi-camera 3D reconstruction is a key foundation for immersive media, remote interaction and spatial computing. While synchronized camera arrays are widely adopted, achieving geometrically consistent and scalable real-time reconstruction remains challenging. A key challenge is the close linkage among extrinsic calibration, multi-view fusion and global optimization, which causes fluctuating reconstruction results, cumulative errors and poor system expandability. We propose a decoupled framework for calibration and stateless real-time multi-view point cloud fusion (FUSE-Flow), a framework with two collaborative components: geometry-aligned multi-view extrinsic calibration (GMAC) and reliability-guided multi-view point cloud fusion (FUSE). This split design avoids conflicting optimization objectives for targeted improvement. The GMAC module refines camera extrinsics via geometric constraints and multi-view reconstruction transformers, enabling accurate sparse-view calibration without calibration targets, dense images or global bundle adjustment. The FUSE module integrates confidence weighting and adaptive spatial hashing for stateless fusion, ensuring linear time and memory consumption. The two modules mutually reinforce each other: accurate camera poses boost fusion accuracy, and confidence-aware fusion corrects calibration biases. Validated on public datasets and real camera setups, FUSE-Flow outperforms mainstream real-time reconstruction methods in visual effect, dynamic stability and scalability, providing a practical solution for large-scale real-time 3D reconstruction.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations</title>
  <link>https://arxiv.org/abs/2603.07584</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.07584v2 Announce Type: replace-cross Abstract: Computational engine sound modeling is central to the automotive audio industry, particularly for active sound design applications and virtual prototyping. Emerging data-driven engine sound synthesis methods require large volumes of standardized, clean audio recordings with precisely time-aligned operating-state annotations: data that is difficult to obtain due to high costs, specialized measurement equipment requirements, and inevitable noise contamination. We present an analysis-driven framework for generating engine audio with sample-accurate control annotations. The method extracts harmonic structures from real recordings through pitch-adaptive spectral analysis, which then drive an extended parametric harmonic-plus-noise synthesizer. With this framework, we augment 5-10 min of source audio per engine 15-30x via diverse control trajectories and parametric variation, producing the Procedural Engine Sounds Dataset (19.0 h, 5,935 files): a set of engine audio signals with sample-accurate RPM and torque annotations spanning a wide range of operating conditions, signal complexities, and harmonic profiles. Comparison against real recordings validates that the synthesized data preserves characteristic harmonic structures, and a baseline differentiable synthesis network trained on the dataset confirms its suitability for data-driven engine sound modeling. The dataset is released publicly to support research on engine timbre analysis, control parameter estimation, and neural generative synthesis.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>VGGSounder: Audio-Visual Evaluations for Foundation Models</title>
  <link>https://arxiv.org/abs/2508.08237</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.08237v4 Announce Type: replace-cross Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing</title>
  <link>https://arxiv.org/abs/2606.01804</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01804v2 Announce Type: replace Abstract: Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code are avaialble at https://github.com/daxintan-cuhk/SpeechEditBench .</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels</title>
  <link>https://arxiv.org/abs/2605.30457</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30457v2 Announce Type: replace Abstract: Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Neural Directional Filtering with Configurable Directivity Pattern at Inference</title>
  <link>https://arxiv.org/abs/2510.20253</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.20253v2 Announce Type: replace Abstract: Spatial filtering with a desired directivity pattern is advantageous for many audio applications. In this work, we propose neural directional filtering with user-defined directivity patterns (UNDF), which enables spatial filtering based on directivity patterns that users can define during inference. To achieve this, we propose a DNN architecture that integrates feature-wise linear modulation (FiLM), allowing user-defined patterns to serve as conditioning inputs. Through analysis, we demonstrate that the FiLM-based architecture enables the UNDF to generalize to unseen user-defined patterns during interference with higher directivities, scaling variations, and different steering directions. Furthermore, we progressively refine training strategies to enhance pattern approximation and enable UNDF to approximate irregular shapes. Lastly, experimental comparisons show that UNDF outperforms conventional methods.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors</title>
  <link>https://arxiv.org/abs/2509.21597</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.21597v2 Announce Type: replace Abstract: With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across a wide range of speech and non-speech audio datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors under diverse manipulation types and recording conditions. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, we highlight how AUDDT differs from existing benchmarking efforts by enabling large-scale, diverse evaluation across modern spoofing methods and richer attribute-level analysis through comprehensive metadata annotation. Using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable performance variability across different conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gaps relative to practical deployment scenarios.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References</title>
  <link>https://arxiv.org/abs/2508.14623</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2508.14623v2 Announce Type: replace Abstract: This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Audio Interaction Model</title>
  <link>https://arxiv.org/abs/2606.05121</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.05121v1 Announce Type: cross Abstract: Audio is an inherently interactive modality, yet today&#39;s Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SURF: Separation via Unsupervised Remixing Flow</title>
  <link>https://arxiv.org/abs/2606.04921</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04921v1 Announce Type: cross Abstract: The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a &quot;remixing&quot; step to bootstrap the learning of a student flow model from the teacher&#39;s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. See our demo page for examples. https://google.github.io/df-conformer/surf/</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Multilingual Long-Form Speech Instruction Following: KIT&#39;s Submission to IWSLT 2026</title>
  <link>https://arxiv.org/abs/2606.04730</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04730v1 Announce Type: cross Abstract: With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT&#39;s Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT&#39;s submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention</title>
  <link>https://arxiv.org/abs/2606.04474</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04474v1 Announce Type: cross Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding</title>
  <link>https://arxiv.org/abs/2606.04418</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04418v1 Announce Type: cross Abstract: Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses</title>
  <link>https://arxiv.org/abs/2606.04358</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04358v1 Announce Type: cross Abstract: The image-source model (ISM) is a widely adopted method for efficiently simulating acoustic room impulse responses (RIRs) under specular reflection assumptions. Acoustic paths between source and receiver are traced to lattice points computed from successive reflections over bounding planes of the room. Rectangular rooms bound the total number of image-sources to be polynomial in the RIR&#39;s duration or distance $k$ equivalent, with degree equal the number of room dimensions $N$. Direct ISM simulations are therefore compute upper-bound by $O \left ( k^N \right )$, and consider only cases of $N \leq 3$ for tractability and real-world applications. This work proposes an alternative computational method that lowers the asymptotic compute bound to $O \left ( N k^2 \log k \right )$ for integer coordinates and room dimensions via reducing ISM lattice point counting to the classic Gauss circle problem (GCP). We extend the lattice counting model to frequency-dependent and reflection weighted image-sources in higher dimensions, relating solutions between successive dimensions via the convolution operator. Two constructions for realizing RIRs are presented, along with time-frequency controls, error and run-time analysis, and RIR statistics.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid</title>
  <link>https://arxiv.org/abs/2606.04221</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04221v1 Announce Type: cross Abstract: Hearing aids impose strict latency and power constraints that current DNN-based speech enhancement systems struggle to meet on embedded hardware. We characterize this gap by deploying both speech separation and denoising using the lightweight SuDoRM-RF++ architecture on the AMD-Xilinx Kria KV260, evaluated at FP32 and 16-bit fixed-point precision for each task. Across these configurations, first-sample latency tracks with on-chip parameter caching rather than arithmetic throughput, identifying data movement as the primary bottleneck. Precision reduction halves the model memory footprint without compromising objective speech quality. The fixed-point denoising accelerator achieves a first-sample latency of 9.7~ms, meeting the 10~ms clinical threshold, while speech separation reaches 16.0~ms. These measurements establish concrete resource requirements for embedded DNN-based speech enhancement and quantify the remaining gap to hearing aid deployment.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids</title>
  <link>https://arxiv.org/abs/2606.04103</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04103v1 Announce Type: cross Abstract: Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails to provide sufficient listening support in complex environments, such as situations with multiple speakers (the ``cocktail party&#39;&#39; problem). To more comprehensively address the underlying encoding dysfunctions of hearing loss, we introduce the Differentiable Auditory Loop (DAL), a new open-source framework for personalized hearing aid design and fitting. Our first implementation of DAL incorporates CARFAC, a differentiable model of human cochlear function, which we ported to JAX, to optimize a deep neural network to match impaired auditory neural activity patterns with a normal-hearing reference. To build a hearing aid with the fine-grained spectro-temporal signal processing required, we adopt SEANet, a waveform-to-waveform fully convolutional UNet generator. We fine-tune the network by comparing the outputs of a CARFAC model fitted to normal hearing with that of a CARFAC model fitted to match each subject&#39;s individual hearing impairment. The comparison is done using loss functions derived from the respective CARFAC neural activity pattern (NAP) outputs and stabilized auditory images (SAIs), the latter providing a 2D representation that captures phase-insensitive temporal structure in the auditory nerve output. Through gradient descent, the SEANet model learns to both denoise the input and compensate for the hearing loss modelled by the impaired CARFAC model. Across neural-representation and signal-fidelity metrics, the DAL-optimized SEANet model outperformed the tested master hearing aid (MHA) baselines. The DAL framework provides a practical path toward model-based, machine-learning-driven personalization of hearing aid signal processing. Next steps include hardware deployment to enable real-world clinical testing.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Channel-Oriented Design for EEG-to-Music Reconstruction</title>
  <link>https://arxiv.org/abs/2606.04040</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04040v1 Announce Type: cross Abstract: Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Differentiable Articulatory Copy-Synthesis of Biphonic Singing</title>
  <link>https://arxiv.org/abs/2606.04943</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04943v1 Announce Type: new Abstract: Sygyt is a Tuvan style of biphonic singing in which a low vocal drone is sustained while a high harmonic is selectively amplified in the 1--3\,kHz region. Copy-synthesizing this effect remains challenging for articulatory models, since it requires fine control of narrowly focused resonances that standard low-dimensional tract parameterizations cannot easily reproduce. We address this problem with a differentiable Kelly--Lochbaum waveguide augmented with a sublingual second source, cubic B-spline tract parameterization, and spatially varying learnable damping, optimized end-to-end by gradient descent from audio. On 20 segments from two independent sygyt datasets (5 singers, 10 pitches), the proposed model reduces log-spectral distance by 30--38\% relative to an articulatory baseline, with the largest gains concentrated in the overtone region. Cepstral-envelope analysis further shows more accurate recovery of the merged formant structure characteristic of sygyt production. The model also outperforms a DDSP harmonic-plus-noise baseline with direct per-harmonic spectral control, suggesting that explicit acoustic structure is a useful inductive bias for overtone-singing copy-synthesis.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning</title>
  <link>https://arxiv.org/abs/2606.04939</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04939v1 Announce Type: new Abstract: Audio generation and audio-to-text understanding remain largely separate, with diffusion models dominating high-fidelity synthesis and autoregressive (AR) language models driving captioning and semantic prediction. Existing unified approaches typically rely on either heterogeneous modules or AR-centric modeling, which can hinder joint optimization and limit acoustic fidelity. We present UAT, to our knowledge, the first diffusion-centric framework that supports unified audio generation, editing, and captioning. UAT couples continuous latent diffusion for audio with masked discrete diffusion for text, enabling bidirectional audio-text modeling within a shared dual-stream backbone. Experiments show that UAT preserves strong audio generation and editing capabilities while achieving competitive captioning performance, demonstrating a favorable balance between acoustic synthesis and semantic prediction. Demo samples are available at https://UAT-demo.github.io.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy</title>
  <link>https://arxiv.org/abs/2606.04680</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04680v1 Announce Type: new Abstract: Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction</title>
  <link>https://arxiv.org/abs/2606.04370</link>
  <pubDate>Thu, 04 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.04370v1 Announce Type: new Abstract: In this paper, we propose a reconstruction framework that leverages the Wavelet Scattering Transform (WST) as a multi-scale feature extractor to impose statistical priors under sparse observation conditions. The reconstruction problem is formulated as an optimization task and solved using a neural field, with the WST incorporated into the training loss function. As a proof of concept, we validate the proposed method on HRTF upsampling. A masking strategy is applied to the WST coefficients, resulting in a two-phase procedure. The first phase learns a binary mask from a small multi-subject dataset, while the second phase applies the learned mask to the WST coefficients of an individual HRTF to preserve informative statistical structures during reconstruction. Validation against baseline methods, which also serve as an ablation study of the different components of the framework, demonstrates the effectiveness of the proposed approach.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF</title>
  <link>https://arxiv.org/abs/2606.02639</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02639v1 Announce Type: new Abstract: We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p =10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting</title>
  <link>https://arxiv.org/abs/2606.02661</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02661v1 Announce Type: new Abstract: Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Depth from Dual Differential Defocus and Stereo Consensus</title>
  <link>https://arxiv.org/abs/2606.02906</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02906v1 Announce Type: new Abstract: We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>AtlasGS: Brain MRI Spatial Resolution Harmonization With Shared Gaussian Geometry</title>
  <link>https://arxiv.org/abs/2606.02961</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02961v1 Announce Type: new Abstract: Splatting (GS)-based shared geometry framework adopts a two-stage training strategy, in which an explicit, subject-specific Gaussian scaffold encoding anatomical geometry is first learned from the isotropic structural scan and then reused to fit appearance for target modalities acquired with sparse slices. Experiments on the UK Biobank, GBM, and ABCD datasets for through-plane super-resolution across multiple modalities (T2-weighted, FLAIR, DWI, ASL), degradation factors ($\times 3$, $\times 5$, $\times 7$), and pathological abnormalities (glioblastoma) demonstrate state-of-the-art reconstruction fidelity. The shared Gaussian geometry enables arbitrary-view generation for target modalities with strong structural consistency and further shows potential for self-supervised in-plane super-resolution. This work establishes explicit geometry-guided representations as a novel, flexible, and interpretable pathway toward retrospective multi-contrast MRI harmonization and reliable clinical reference construction. Source code is available at: https://github.com/yfgao76/AtlasGS</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>SMAC: Spatial-Modal Joint Modeling and Adaptive Representation Collapse for Multimodal Object Tracking</title>
  <link>https://arxiv.org/abs/2606.03370</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03370v1 Announce Type: new Abstract: Multimodal multi-object tracking (MOT) under complex illumination remains challenging due to insufficient joint modeling of spatial and modal features and the limited adaptability of fixed fusion strategies. To address these issues, this paper proposes a spatial-modal convolution fusion and distillation-prompt-based multimodal MOT framework. A spatial-modal fusion backbone is first constructed, where a Basic module performs spatial feature extraction and modal interaction via decoupled 3D convolution, while a Mixed module models nonlinear cross-modal correlations through amplitude-phase decomposition. In addition, a representation collapse network is designed for adaptive multimodal fusion. A Distillation Prompt Guidance (DPG) module generates dynamic modal weights under teacher supervision, and a Global Modal Difference Aggregation (GMDA) module preserves discriminative information during multimodal representation collapse. Extensive experiments on the UniRTL dataset demonstrate the effectiveness of the proposed method. The proposed tracker achieves 63.31 HOTA and 79.21 MOTA on the RNT modality, outperforming several state-of-the-art methods while maintaining favorable inference efficiency. The source code and pretrained models are publicly available at https://github.com/QitaiSun/SMAC.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>When BBR Meets Live Streaming</title>
  <link>https://arxiv.org/abs/2606.03468</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03468v1 Announce Type: new Abstract: Recently, industrial pioneers like Amazon, Tencent, ByteDance, and Huawei have been adopting BBR as their congestion control algorithm for live-streaming applications, including TikTok Live. However, BBR, originally crafted for bulk data transmission, faces multiple challenges in live-streaming scenarios. In this paper, we first explore two key issues associated with BBR due to inaccurate bandwidth estimation in live-streaming scenarios: (i) BBR cannot easily exit its startup phase, resulting in a fierce self-inflicted loss. (ii) BBR sends data at a lower rate than the available bandwidth during its stable phase. We then propose BBR-Copilot, an auxiliary congestion control component that cooperates with BBR, making BBR better adapt to live-streaming scenarios. BBR-Copilot allows for proactively generating accurate bandwidth measurement samples by smartly creating and sending extra data. We implement the BBR-Copilot prototype upon QUIC and evaluate it via testbed. Experimental evaluation results show that BBR-Copilot effectively enhances BBR&#39;s performance in live-streaming scenarios.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction</title>
  <link>https://arxiv.org/abs/2606.03940</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03940v1 Announce Type: new Abstract: In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification</title>
  <link>https://arxiv.org/abs/2606.02605</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02605v1 Announce Type: cross Abstract: Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time- and resource-intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non-invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis-specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg-stenosis-cls.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Hand Trajectory Fusion for Egocentric Natural Language Query Grounding</title>
  <link>https://arxiv.org/abs/2606.02962</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02962v1 Announce Type: cross Abstract: Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection</title>
  <link>https://arxiv.org/abs/2606.03251</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03251v1 Announce Type: cross Abstract: In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Steered Mixture of Experts Regression for Image Denoising with Multi-Model-Inference</title>
  <link>https://arxiv.org/abs/2303.17409</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2303.17409v2 Announce Type: replace Abstract: In this paper we introduce a novel block-based regression strategy for image denoising based on edge-aware Steered-Mixture-of-Experts (SMoE) models. SMoEs provide very sparse image representations, able to model sharp edges as well as smooth transitions in images efficiently with few parameters. A multi-model inference strategy is developed that improves significantly the denoising capacity of single SMoE models. We show that the important edge reconstruction properties of SMoEs are well preserved, even when many models are fused under severe noise. We investigate model-inference from local neighborhood blocks as well as from distant blocks using block-matching as in BM3D. Our initial results indicate that SMoE multi-model regression can provide promising results compared to state-of-the-art BM3D with excellent edge quality.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Towards Blind Lens Aberration Correction via Large LensLib Pre-training and Discrete Degradation Priors</title>
  <link>https://arxiv.org/abs/2511.17126</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2511.17126v4 Announce Type: replace Abstract: Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes FoundCAC, a universal foundational framework that resolves two challenges hindering the generalization of existing pipelines: the difficulty of scaling training data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase degradation diversity and construct AODLibpro, a large-scale, unbiased lens library based on a uniform sampling strategy that quantifies spatial-variation patterns and severity. In terms of model design, to leverage Point Spread Functions (PSFs) as guidance while maintaining the blind paradigm, we propose a multi-stage vector-quantized representation learning scheme. This paradigm is specifically designed to construct a Latent PSF Representation (LPR), explicitly encoding complex continuous PSFs into a discrete degradation prior to regularize the highly ill-posed restoration process. Through a simple yet effective codebook-freezing strategy, our framework leverages the discrete prior to elevate full-shot restoration performance and unlock highly efficient few-shot adaptation for unseen lenses. Experiments on diverse aberrations of synthetic LensLib and real-world lenses demonstrate that our framework achieves state-of-the-art zero-shot generalization while enabling highly efficient few-shot adaptation for specific lenses. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/FoundCAC.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review</title>
  <link>https://arxiv.org/abs/2601.00990</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2601.00990v3 Announce Type: replace Abstract: Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification</title>
  <link>https://arxiv.org/abs/2605.27454</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.27454v2 Announce Type: replace Abstract: X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11--10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Short-Acquisition Contrast-Free Super-Resolution Microvascular Imaging in Rabbit Kidney</title>
  <link>https://arxiv.org/abs/2606.02782</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02782v1 Announce Type: new Abstract: Ultrasound localization microscopy (ULM) enables micrometer-scale microvascular imaging by localizing and tracking intravascular microbubbles, but its dependence on exogenous contrast agents and long acquisition times limits clinical translation. This study presents a high-frame-rate contrast-free super-resolution ultrasound microvascular imaging method based on high-frequency ultrafast ultrasound and nonlinear beamforming of backscatter signals from native blood flow. Using only 125 milliseconds of in vivo ultrafast data per image, the proposed method achieved an imaging frame rate of 8 frames/s in a rabbit kidney model. The reconstructed microvascular images resolved vessels with a global spatial resolution of 22.2 um over a field of view of 23.04 x 15.18 mm2, where the wavelength of ultrasound was 67.5 um. This corresponds to a three-fold improvement over conventional power Doppler imaging under the same acquisition duration. Compared with conventional flow imaging, the proposed method provided improved microvascular contrast and finer vessel delineation without microbubble injection. These results demonstrate a practical pathway toward high frame rate, contrast-free super-resolution ultrasound imaging for microvascular assessment.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Global Unknown Estimation: A Statistical Framework for Wireless Distributed Learning</title>
  <link>https://arxiv.org/abs/2606.02891</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02891v1 Announce Type: new Abstract: Over-the-air computation (AirComp) is widely used for model aggregation in wireless distributed learning. Although it enhances communication efficiency, we believe the AirComp aggregation has limited effectiveness due to the difference between its target problem and that of distributed learning. In this paper, we develop a rigorous formulation for optimal model aggregation in wireless distributed learning. Using this formulation, we show that AirComp aggregation generally assumes a mismatched statistical model for local parameters. We then propose a statistical framework for model aggregation, called global unknown estimation (GUE). It captures the statistical relation between the local and global model parameters, allowing to interpret model aggregation as an inference task. We validate the efficiency of GUE through numerical experiments. Our results show that, in the low SNR regime, GUE can reduce the required power for model aggregation by approximately 15 dB compared to AirComp aggregation. Remarkably, this gain is obtained without additional computational overhead</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Fault-Aware Design for Reconfigurable Holographic Surface-Aided ISAC Systems</title>
  <link>https://arxiv.org/abs/2606.03013</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03013v1 Announce Type: new Abstract: Reconfigurable holographic surface (RHS)-aided integrated sensing and communication (ISAC) systems hold great promise for achieving both sensing and communication with low hardware costs and high energy efficiency. However, existing works largely overlook practical hardware impairments in RHSs, particularly faulty RHS elements with uncontrollable amplitudes, which degrade system performance if left unaddressed. This work aims to fill the gap by i) quantifying the impact of faulty RHS elements on ISAC performance and ii) optimizing the functional RHS elements to preserve the ISAC performance. Specifically, we derive the misspecified Cramer-Rao bound (MCRB) for sensing and the signal-to-interference-and-noise ratio (SINR) for communication to measure the performance loss caused by faulty elements. We then formulate an optimization problem that minimizes MCRB, subject to constraints on SINR, transmit power budget, and RHS amplitude. The high non-convexity of the formulated problem poses a significant challenge, which we address by reformulating and proposing a block coordinate descent-based solution incorporating majorization-minimization and successive convex approximation techniques. Simulation results verify that the proposed approach achieves an average 13.7% performance gain compared to the fault-unaware benchmark.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Node-Oriented Proactive Spectral Modulation: A Unified Fractional Framework for Graph Signal Denoising</title>
  <link>https://arxiv.org/abs/2606.03337</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03337v1 Announce Type: new Abstract: Graph signal denoising is a fundamental task in graph signal processing. While the node-oriented filtering approach enhances spatial adaptability, it suffers from spectral rigidity due to its reliance on the graph Fourier transform. Conversely, emerging fractional-domain transforms provide crucial spectral flexibility but are fundamentally limited by their globally shared filtering paradigm, failing to accommodate localized topological variations. To bridge this gap, this paper proposes a generalized node-oriented fractional filtering (NOFF) framework that seamlessly integrates localized spatial adaptability with proactive spectral modulation across various fractional transforms. However, straightforwardly assigning independent full-rank filters to all vertices incurs a prohibitive parameter space, leading to severe overfitting on random noise. To mitigate this, we introduce the low-rank NOFF (LRNOFF) architecture. By imposing a strict low-rank constraint, LRNOFF inherently acts as a powerful implicit regularizer, preventing noise memorization and ensuring the extraction of robust spectral bases. Furthermore, we develop an efficient computational implementation termed LRNOFF-Fast, which drastically reduces computational and memory overhead while preserving theoretical optimality. Experiments on real-world datasets demonstrate that the proposed framework achieves state-of-the-art performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Instantaneous Risk Minimization for Secure Integrated Sensing and Communication</title>
  <link>https://arxiv.org/abs/2606.03372</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03372v1 Announce Type: new Abstract: To ensure worst-case physical layer security, this paper proposes a robust beamforming framework for secure integrated sensing and communication (ISAC) systems. Different from conventional designs that focus on maximizing the ergodic secrecy rate, the proposed method aims to minimize instantaneous information leakage risk. We formulate a multi-objective optimization problem that jointly suppresses the worst-case eavesdropper signal-to-interference-plus-noise ratio (SINR), improving sensing accuracy, and ensuring the quality of service (QoS) for legitimate users. To address the resulting non-convex problem, we develop a hierarchical iterative algorithm, in which the outer loop refines the continuous uncertainty regions based on the updated sensing performance, and the inner loop optimizes beamforming under the refined uncertainty regions. Theoretical analysis and simulation results demonstrate that the proposed method achieves per-transmission security guarantees with practical complexity.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Voxel-CKM: Voxelized Radio Frequency Radiance Fields for Fast and Few-Shot CKM Construction</title>
  <link>https://arxiv.org/abs/2606.03531</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03531v1 Announce Type: new Abstract: Channel knowledge maps (CKMs) are designed to predict channel state information (CSI) from user locations, thereby enabling low-overhead CSI acquisition. However, existing CKM construction methods often require hours-to-days of training time and dense measurements, resulting in substantial deployment cost. In this paper, we propose Voxel-CKM, a novel voxelized radio frequency (RF) radiance field framework for fast and few-shot CKM construction. The core idea is to replace implicit neural representations with explicit voxel grids to efficiently capture the spatial variation of wireless channels. Building upon this, we further introduce a compact vector-matrix (VM) decomposition to parameterize these voxel grids using a small set of matrices and vectors, which significantly accelerates convergence and facilitates fast CKM construction. To enable few-shot learning, we incorporate a transmitter prior as an inductive bias to guide the learning process under sparse measurements. Additionally, a total-variation (TV) regularization loss is proposed to mitigate overfitting and stabilize optimization. Experiments show that Voxel-CKM substantially accelerates training convergence and improves performance in the few-shot regime.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Chasing Lightning: Detecting, Characterizing, and Identifying a Powerful Space-Based GNSS Interference Source</title>
  <link>https://arxiv.org/abs/2606.03673</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03673v1 Announce Type: new Abstract: This paper analyzes and identifies a space-based Global Navigation Satellite System (GNSS) interference source that has caused scores of powerful transient wide-area interference events over continental Europe, Greenland, and Canada since 2019. While terrestrial or near-terrestrial sources are primarily responsible for the recent uptick in GNSS interference worldwide, space-based interferers are of special concern given their potential for vast geographic reach and their portent of a qualitative escalation in GNSS interference. Based on data collected between 2019 and 2026 from a network of terrestrial GNSS reference stations, this paper (1) develops a received-power-based detection framework; (2) details the spatial, temporal, and spectral patterns of wide-area interference events caused by the source; (3) presents and analyzes identification techniques that blend received-power and time-difference-of-arrival measurements; and (4) applies these techniques to confidently identify the GNSS interference source as a constellation of Russian early warning satellites in Molniya (&quot;lightning&quot;) orbits.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Constrained Pinching Antenna Array Design for Sum-Rate Maximization in Multi-User PASS</title>
  <link>https://arxiv.org/abs/2606.03830</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03830v1 Announce Type: new Abstract: Pinching antenna systems (PASS) have recently emerged as a promising architecture for flexible indoor wireless communications. However, most existing pinching antenna (PA) array designs for multi-user PASS either offer limited beam adaptation accuracy or require prohibitively high deployment cost. In this paper, we investigate a more practical constrained pinching antenna array (C-PAA)-assisted downlink PASS, where multiple PAs are grouped into a movable array and can be finely adjusted within the array at the wavelength scale. To improve the system spectral efficiency, a sum-rate maximization problem is formulated by jointly considering the array-center position and the fine-grained antenna distribution within the C-PAA. First, the structural properties of the C-PAA are characterized, and an explicit upper bound on the array aperture is derived. Then, tractable approximations for the effective channel gain and the achievable user rate are developed. Furthermore, the optimization problem of the multi-user sum-rate is analyzed, where the system sum-rate function is shown to exhibit a favorable unimodal behavior under practically relevant conditions, which enables an efficient one-dimensional search for the optimal C-PAA position. To further reduce the computational complexity, a closed-form approximate solution for the near-optimal array-center position is derived. Numerical results verify the accuracy of the developed analysis and demonstrate that the proposed C-PAA scheme closely approaches the ideal upper bound and significantly outperforms conventional fixed-spacing and existing PA array benchmarks.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>SNF-PRP: A Covert Integrating Sensing and Communications Framework</title>
  <link>https://arxiv.org/abs/2606.03960</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03960v1 Announce Type: new Abstract: Integrated sensing and communication (ISAC) enables simultaneous sensing and data transmission but exposes a critical vulnerability: probing signals may be intercepted, revealing both the transmitted information and the act of sensing itself. Existing physical layer security approaches mitigate interception yet operate with detectable signals, leaving sensing activity observable to a passive warden. This paper introduces sub-noise-floor pseudo-random probing (SNF-PRP), a covert sensing framework for OFDM-based ISAC systems under an energy-detection adversary model. SNF-PRP establishes an $\epsilon$-covertness guarantee via Kullback-Leibler (KL) divergence, exploits an $N_{\mathrm{sc}}$-fold spreading gain absent from prior wideband analyses, and derives in closed form the minimum integration length required to achieve a target Cram\&#39;{e}r-Rao bound (CRB). Simulations under 5G~NR n78 numerology confirm sub-0.5\,m range and sub-0.5\,m/s velocity accuracy with KL divergence $5.8\times$ below the covertness threshold, validating joint feasibility at $-12$\,dB and $-15$\,dB probing powers.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification</title>
  <link>https://arxiv.org/abs/2606.02939</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02939v1 Announce Type: cross Abstract: Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Hybrid Free-space-optics and Millimetre-wave D-band Trans-mitter enabled by Optically Harmonically Locked Lasers</title>
  <link>https://arxiv.org/abs/2606.03734</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03734v1 Announce Type: cross Abstract: We demonstrated hybrid free-space optics (FSO) and D-band (110-170GHz) millimetre wave transmitter enabled by a single phase-locked laser pair, simultaneously enabling ultra-low RF phase noise and optical linewidth for communications. Based on this, we further study combined capacity with beam angle misalignment using &gt;100Gb/s signalling.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Stable Hybrid Cross-Attention Fusion for Audio-Visual Event Recognition</title>
  <link>https://arxiv.org/abs/2606.03747</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03747v1 Announce Type: cross Abstract: Audio-Visual Event Recognition (AVER) is essential for intelligent urban monitoring systems, where robust multimodal understanding of complex environments is required. This paper proposes a stable hybrid cross-attention fusion framework for audio-visual event recognition in smart urban environments. The proposed architecture combines pretrained Video Masked Autoencoder (VideoMAE) and Audio Spectrogram Transformer (AST) representations with FiLM-based audio conditioning, bidirectional cross-attention fusion, multimodal Transformer encoding, and modality-temporal attention. To improve computational efficiency and training stability, frozen pretrained backbones and cached feature extraction are employed. Extensive experiments on the AVE dataset show that the proposed framework achieves the highest average performance among the evaluated unimodal and multimodal baselines across multiple evaluation metrics, obtaining a best validation accuracy of 91.74% and a test accuracy of 83.85 plus/minus 1.40% over five independent runs. The results indicate that the proposed hybrid fusion strategy effectively captures complementary audio-visual information and provides robust multimodal representation learning for challenging realworld urban monitoring scenarios.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Limit Analysis of Graph Neural Networks with Wireless Conflict Graphs</title>
  <link>https://arxiv.org/abs/2606.03794</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03794v1 Announce Type: cross Abstract: Graph Neural Networks (GNNs) have emerged as a powerful tool for wireless resource allocation that leverages the underlying graph structure of communication networks. Their transferability property enables models trained on small-scale graphs to generalize to large-scale deployments with little performance deterioration, a desirable property for currently growing networks. Wireless networks are sparse regimes, where a single node is connected to a small number of other users. This work establishes theoretical results for transferability of GNNs over graphs derived from sparse Random Geometric Graphs (RGGs). In particular, we focus on conflict graphs of RGGs used to model interference among links. Our approach considers the closeness between RGGs and Deterministic Grid Graphs (DGG) to establish bounds in the performance loss when a model is transferred across scales. We validate our theoretical findings through the problem of link scheduling, demonstrating that our learned policies consistently outperform existing benchmarks at scale. Finally, we examine the impact of our theoretical assumptions on empirical performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Sparse Activation for Sustainable Cell-Free Massive MIMO Networks: Less is More</title>
  <link>https://arxiv.org/abs/2606.03912</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03912v1 Announce Type: cross Abstract: Motivated by the vision of making sixth-generation (6G) networks sustainable, we study the sparse antenna/array activation problems in uplink cell-free massive multiple-input multiple-output (CF mMIMO) networks. We first develop an antenna-level optimal bilinear equalizer (OBE) weighting framework, in which each access point-user equipment (AP-UE) pair is assigned a matrix-valued long-term weight to shape the contribution of individual antenna elements, thereby generalizing the conventional large-scale fading decoding (LSFD) strategy from scalar coefficients to antenna-element-aware weighting. Building on this structure, we formulate sparse antenna activation as structured sparsity-inducing mean square error (MSE) minimization problems, and design four activation schemes at two granularities: antenna-level and array-level, each with UE-specific and network-wide (all-UEs) variants. The resulting convex problems are solved efficiently via the proximal method with closed-form group-wise updates, while the network-wide schemes are modeled through hierarchical sparsity and handled by a tree-structured proximal operator. Numerical results under correlated Rician channels and a detailed power consumption model demonstrate that the OBE weighting scheme consistently improves spectral efficiency over the LSFD, with gains increasing with the number of antennas. Meanwhile, the studied sparse activation schemes can achieve substantial energy efficiency improvement and power reduction with controllable spectral efficiency loss.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Least-Squares Adaptive Filter-Based Cohen&#39;s Class Time-Frequency Distribution for Signal Denoising</title>
  <link>https://arxiv.org/abs/2408.04210</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2408.04210v4 Announce Type: replace Abstract: Inspired by the use of adaptive kernel-based Cohen&#39;s class time-frequency distributions (CCTFDs) for cross-term suppression, this paper aims to explore novel adaptive kernel functions for denoising, with a particular focus on non-stationary signal processing in practical applications}. We integrate Wiener filter principle and the time-frequency filtering mechanism of CCTFD to design the least-squares adaptive filter method in the Wigner-Ville distribution (WVD) domain, giving birth to the least-squares adaptive filter-based CCTFD whose kernel function can be adjusted with the input signal automatically to achieve the minimum mean-square error denoising in the WVD domain. {Numerical experiments on typical simulated radar signals and real-world electrocardiogram data comprehensively demonstrate that the proposed adaptive CCTFD outperforms several state-of-the-art methods in noise suppression.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Power-Aware Cognitive Radar Multi-target Tracking Under Unknown Disturbances</title>
  <link>https://arxiv.org/abs/2507.17506</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.17506v4 Announce Type: replace Abstract: This work presents a cognitive radar (CR) framework designed to track multiple aircraft under unknown disturbances using massive multiple-input multiple-output (MMIMO) systems. Since uniform power allocation is suboptimal across varying signal-to-noise ratios (SNRs), we couple an adaptive waveform design driven by Partially Observable Monte Carlo Planning (POMCP). By assigning an independent POMCP tree to each target, the system efficiently predicts target states. These predictions inform a constrained optimization problem that actively directs transmit energy toward weaker targets while maintaining sufficient power for stronger ones. Results confirm that the proposed POMCP method improves the detection probability for low-SNR targets from 0.6 to nearly 0.9, and yields more accurate tracking of the weakest target than a non-adaptive orthogonal waveform or a cognitive uniform-power POMCP baseline.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>MPFSR-Enhanced GNNs: Spectral Graph Neural Networks Enhancement Through Learnable Multiple-Parameter Graph Fractional Fourier Transforms</title>
  <link>https://arxiv.org/abs/2507.23570</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2507.23570v2 Announce Type: replace Abstract: Graph neural networks (GNNs) excel in processing non-Euclidean data, but traditional spectral GNNs rely on static bases and fundamentally lack active spectral regulation. Although the graph fractional Fourier transform (GFRFT) introduces cross-domain modulation, it applies a uniform fractional parameter across all frequencies. This ignores frequency heterogeneity and restricts the models&#39; adaptive capacity in graph node classification tasks. In the paper, we propose two novel types of multiple-parameter GFRFTs (MPGFRFTs) and establish their corresponding theoretical frameworks, including essential properties, computational complexity, and parameters differentiability. By assigning independent, learnable fractional parameters to distinct frequency bands, MPGFRFTs enable fine-grained spectral regulation. Then, we operationalize this mathematical framework by designing the adaptive multiple-parameter fractional spectral regulation (MPFSR) module, a plug-and-play component for mainstream spectral models. We also establish rigorous theoretical bounds on the spectral stability of this module, guaranteeing a stable and reliable convergence during the end-to-end parameters optimization. Experiments demonstrate that integrating the proposed MPFSR module alleviates the constraints of static bases and yields performance gains in node classification on complex graphs, advancing a novel paradigm for active spectral modulation in graph representation learning.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Non-Identical Diffusion Models in MIMO-OFDM Channel Generation</title>
  <link>https://arxiv.org/abs/2509.01641</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.01641v3 Announce Type: replace Abstract: We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>SVD-Based UGRM-GFT on Directed Product Graphs</title>
  <link>https://arxiv.org/abs/2510.10532</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.10532v2 Announce Type: replace Abstract: Traditional directed graph signal processing generally depends on fixed representation matrices, whose rigid structures limit the model&#39;s ability to adapt to complex graph topologies. To address this issue, this study employed the unified graph representation matrix (UGRM) to propose a generalized graph Fourier transform (UGRM-GFT) method based on singular value decomposition (SVD) for signal analysis on directed graphs and Cartesian product graphs. We defined UGRM-GFT for general directed graphs by introducing a parameterized UGRM that incorporates traditional representations such as the Laplacian matrix and adjacency matrix. The SVD is used to construct spectral transform pairs with both left and right singular vectors. We extended this approach to two types of UGRM-GFTs applied to directed Cartesian product graphs. UGRM-GFT-I performs SVD directly on the composite UGRM matrix of the two-dimensional graph structure, suitable for globally coupled graph signals. UGRM-GFT-II separately applies SVD to the UGRMs of the two-factor graphs and then combines the results, significantly reducing computational complexity while preserving spectral expressiveness. Theoretical analysis confirmed the monotonicity of the proposed method with respect to the parameters alpha and k embedded in the UGRM. Experimental results on real-world datasets demonstrated that the proposed method significantly outperforms traditional fixed-matrix approaches in denoising tasks, with a particular emphasis on signal-to-noise ratio and bandwidth efficiency.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Covariance-Guided DFT Beam Selection for Beamspace ESPRIT in Hybrid mmWave Sensor Arrays</title>
  <link>https://arxiv.org/abs/2512.00898</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2512.00898v2 Announce Type: replace Abstract: Accurate direction-of-arrival estimation with hybrid analog--digital millimeter-wave sensor arrays is important for localization, environment sensing, and measurement beam control for sensing applications. However, the limited number of radio-frequency chains and training beams in practical hardware makes it difficult to approach the angular resolution of fully digital arrays. This paper develops a covariance-guided discrete Fourier transform (DFT) beam selection framework tailored to beamspace ESPRIT for hybrid millimeter-wave receivers. A short hybrid training phase realizes a virtual centro-symmetric subarray and yields a sample covariance that is processed by forward--backward averaging, nonnegative least-squares power and noise fitting, and a Toeplitz positive-semidefinite projection to reconstruct a denoised full-aperture covariance matrix. This covariance is then used to score and select, within each coarse sector, small contiguous blocks of DFT beams that concentrate signal energy and preserve effective aperture under a strict beam budget. The selected beams feed a sparse beamspace ESPRIT stage that operates only on actually available adjacent beam pairs, so that the overall complexity is dominated by a single low-dimensional ESPRIT call. Monte Carlo simulations for a thirty-two-element uniform linear array with three paths indicate that, in the considered scenarios, the proposed method can reduce the gap to the Cram\&#39;er--Rao bound, lower the failure rate, and provide favorable accuracy--runtime trade-offs compared with a sectorization-based baseline built from the same codebook and estimator. For the unitary DFT codebook studied here, the fine-stage beam selector reduces to a covariance-guided contiguous energy-window rule; the broader score formulation also accommodates non-unitary effective beam dictionaries arising from hardware non-idealities.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Toward Gripper-Integrated Active Electrosense for Pre-Contact Sensing in Underwater Soft Grippers</title>
  <link>https://arxiv.org/abs/2606.03204</link>
  <pubDate>Wed, 03 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.03204v1 Announce Type: cross Abstract: Underwater manipulation often occurs under degraded visibility due to turbidity, glare, and gripper occlusion, limiting the reliability of vision-based perception during approach and grasping. In such settings, soft grippers are well suited for compliant interaction, but they typically lack an onboard pre-contact cue that can guide approach and closure when vision is unreliable. This extended abstract explores active electrosense as a lightweight sensing modality that can provide a proximity-like signal prior to contact by measuring perturbations of an applied electric field in conductive media. We instrument an octopus-inspired gripper with a discrete electrode layout and record multi-channel sensing voltages using off-the-shelf hardware. Simulation and tank experiments with a suspended conductive sphere show structured, object-dependent changes in the multi-electrode voltage readout relative to empty-water baselines, with detectability varying across excitation of 5 to 20 V and frequencies from 1 mHz to 1 kHz. These findings motivate systematic investigation of gripper-integrated electrosense as a complementary pre-contact cue for underwater soft manipulation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection</title>
  <link>https://arxiv.org/abs/2606.01639</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01639v1 Announce Type: new Abstract: Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurrence and post-surgical remission over the years. RRP demands continuous voice monitoring that existing cross-sectional corpora cannot support. We introduce the first longitudinal voice dataset for RRP, comprising recordings from 26 patients with up to ten years of follow-up. Each session pairs sustained vowels with sentence-level utterances, which are annotated by otolaryngologists and confirmed synchronously with laryngoscopy. Building on this resource, we establish a systematic benchmark spanning handcrafted features, end-to-end deep networks, self-supervised pretrained models, and recent audio large language models, all evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses further confirm that the cross-sectional discriminative signal reflects laryngoscopic disease state rather than stable speaker attributes. This work lays a foundation for rare longitudinal pathological voice tasks in low-resource clinical settings.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Kinship Verification Using Voice</title>
  <link>https://arxiv.org/abs/2606.01704</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01704v1 Announce Type: new Abstract: Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodologies. First, leveraging the speech recordings of the large-scale audio-visual dataset, KAN-AV, we propose a revised evaluation protocol that controls for various confounders and adopts a family-disjoint train--test split to address open-set KV. Second, we analyze the close connection between speaker verification and KV, showing that genealogical similarity of speaker pairs plays opposite roles in the two tasks. Third, we tackle KV using three neural speaker embedding extractors (ECAPA-TDNN, WavLM-ECAPA, and ReDimNet) combined with various back-ends. In zero-shot KV including same-speaker target trials, ReDimNet achieves the lowest equal error rate (EER) of $20.8\%$; however, performance degrades to $39.7\%$ under strict kin trials, where same-speaker target trials are excluded. Our best trainable back-end, which applies asymmetric processing of the embedding pair to mitigate age-difference effects, obtains an EER of $32.0\%$ ($18.6\%$ with speaker target trials included). These results highlight the difficulty of KV while showing that speaker embeddings encode familial cues, offering a promising foundation for voice-based kinship analysis.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning</title>
  <link>https://arxiv.org/abs/2606.01905</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01905v1 Announce Type: new Abstract: Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Localizing broadband noise sources using the Lo\`eve spectrum and a 2.5D approach</title>
  <link>https://arxiv.org/abs/2606.02127</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02127v1 Announce Type: new Abstract: The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used in which the Doppler effect is assumed to be approximately constant and a discrete Fourier transform is done on each segment. In contrast, the authors developed an inverse 2.5D localization method for uniformly moving single-frequency sources that works in the spectral domain and allows for the use of longer windows. This was achieved by modifying the 2.5D forward model to directly compute the effect of the motion in the static observer position. The method does neither require to modify the measured signal nor does it require quasi-stationary of the measurements within the window used. Unfortunately, this approach is not directly suitable for broad-band stochastic sources, and in the present work we will investigate how the statistical properties of a uniformly moving stochastic source change when observed at a static observer. Using a 2.5D setting, the relation between the power spectral density of the moving source and the Lo\`eve spectrum, which is a generalization of the cross-spectral density at the static receivers, was derived. Based on simulated data with speeds up to 100 m\,s$^{-1}$, the work presented here provides a proof of concept for a method based on multi-taper estimates for the Lo\`eve spectrum to localize moving broad-band stochastic sources . Currently, the method requires a stationary source signal and that the spectral density is flat within a certain range around the frequency of interest. Also, correlations between sources are currently not considered.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Domain-Agnostic Incremental Learning for Sound Classification. A DCASE 2026 Challenge task</title>
  <link>https://arxiv.org/abs/2606.02173</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02173v1 Announce Type: new Abstract: This paper presents the Domain-Agnostic Incremental Learning for Audio Classification Task of the DCASE 2026 Challenge. Incremental learning refers to sequentially learning new tasks with the same system while maintaining its knowledge and performance on the previously learned task. Domain-incremental learning for sound classification refers to learning the same sound classes but in different acoustic domains, and was formalized as a data challenge for the first time in DCASE 2026. Participants will train a system to learn ten sound classes in three different domains, with learning at each incremental task not having access to previous task data. Submitted systems will be ranked by the overall average accuracy calculated over the three domains. The provided baseline system obtains a modest performance of 44.9\% accuracy over the three domains, mostly due to erroneous inference of the domain for the test sample.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Breaking the Pair: Evaluating Dyadic Interaction via Speaker Switching</title>
  <link>https://arxiv.org/abs/2606.02185</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02185v1 Announce Type: new Abstract: Speakers in dialogue continuously adapt their communicative behavior across acoustic, lexical, and semantic dimensions, a phenomenon known as conversational entrainment. Modeling this process requires representations that capture the global structure of interaction, yet prior approaches fail to disentangle dyad-specific patterns from speaker-specific traits, limiting their ability to capture true conversational adaptation. We address this with the Dyadic Distance Matrix (DDM), which encodes all pairwise similarities between the turns of two speakers over an entire conversation, capturing long-range cross-speaker dependencies. This raises a key question: does the DDM represent genuine interaction, or merely reflect individual speaker characteristics? We propose the speaker-switch test, a principled control in which one speaker&#39;s turns are replaced with those from an unrelated speaker drawn from a different conversation. This preserves turn-level statistics while disrupting the original dyadic coadaptation. The ability to distinguish real from switched DDMs thus directly evaluates whether the representation encodes interaction-specific structure. Across four embedding types and classifiers including ResNet-50 on the CANDOR corpus, real DDMs are consistently distinguishable from their switched counterparts. Comparisons with LibriSpeech show higher discriminability in read speech, highlighting the role of prosodic variability in naturalistic conversations. GradCAM analysis further reveals distinct structural signatures driving classification. These results establish the speaker-switch test as a robust diagnostic for validating representations of dyadic conversational interaction.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment</title>
  <link>https://arxiv.org/abs/2606.02220</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02220v1 Announce Type: new Abstract: Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Exploiting Noise Inseparability for Weakly-Supervised Discriminative Speech Denoising Using Noisy Targets</title>
  <link>https://arxiv.org/abs/2606.02327</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02327v1 Announce Type: new Abstract: Speech denoising is an often necessary step not only for human listening, but also for downstream processing by systems lacking robustness to noisy, real-world acoustic conditions. Unfortunately, denoising is a problem where conventional in-domain supervised training is not trivial, as the training targets cannot be annotated by humans: producing a clean version of a naturally-noisy speech recording is itself the task to solve. Supervised training is typically performed through the artificial addition of noise to clean speech recordings, which can only be sourced from controlled domains, a significant limitation due to the poor out-of-domain generalization of neural networks. An alternative is noisy target training (NyTT), which simply replaces the clean speech with in-domain noisy recordings, with the hope that learning to remove the artificial noise will extend to the natural. Though having shown promising results, NyTT&#39;s training objective is not minimized by clean speech estimates. We show that by estimating the artificial noise in addition to the naturally-noisy speech, the undesirable optimum can actually be exploited: the residual noise in the speech estimate can be canceled by the noise estimate via simple subtraction. Crucially, the optimum is fully compatible with conventional artificial mixtures, enabling joint training using both types of data with consistent optimization targets, opening the door to improved domain adaptability. The effectiveness of our approach is demonstrated through WHAM! and CHiME-3-based benchmarks.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription</title>
  <link>https://arxiv.org/abs/2606.02400</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02400v1 Announce Type: new Abstract: Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors</title>
  <link>https://arxiv.org/abs/2606.00460</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00460v1 Announce Type: cross Abstract: Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children&#39;s speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects</title>
  <link>https://arxiv.org/abs/2606.01016</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01016v1 Announce Type: cross Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level&#39; speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MURMUR: An Efficient Inference System for Long-Form ASR</title>
  <link>https://arxiv.org/abs/2606.01483</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01483v1 Announce Type: cross Abstract: Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>DiffAU: Diffusion-Based Ambisonics Upscaling</title>
  <link>https://arxiv.org/abs/2510.00180</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.00180v3 Announce Type: replace Abstract: Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization</title>
  <link>https://arxiv.org/abs/2511.13487</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2511.13487v3 Announce Type: replace Abstract: This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation</title>
  <link>https://arxiv.org/abs/2601.06199</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2601.06199v3 Announce Type: replace Abstract: Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Description and Discussion on DCASE 2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes</title>
  <link>https://arxiv.org/abs/2604.00776</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.00776v2 Announce Type: replace Abstract: This paper presents an overview of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2026 Challenge Task 4, Spatial Semantic Segmentation of Sound Scenes (S5). The S5 task focuses on the joint detection and separation of sound events in complex spatial audio mixtures, contributing to the foundation of immersive communication. First introduced in DCASE 2025, the S5 task continues in DCASE 2026 Task 4 with key changes to better reflect real-world conditions, including allowing mixtures to contain multiple sources of the same class and to contain no target sources. In this paper, we describe task setting, along with the corresponding updates to the evaluation metrics and dataset. The experimental results of the submitted systems are also reported and analyzed. The official access point for data and code is https://github.com/nttcslab/dcase2026_task4_baseline.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Step-Audio-R1.5 Technical Report</title>
  <link>https://arxiv.org/abs/2604.25719</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.25719v2 Announce Type: replace Abstract: Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the &quot;verifiable reward trap.&quot; While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical &quot;answering machines,&quot; severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification</title>
  <link>https://arxiv.org/abs/2412.03771</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2412.03771v3 Announce Type: replace-cross Abstract: Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation</title>
  <link>https://arxiv.org/abs/2505.18614</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2505.18614v5 Announce Type: replace-cross Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering</title>
  <link>https://arxiv.org/abs/2510.01891</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.01891v2 Announce Type: replace-cross Abstract: Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Mathematical framework for perception-driven parameter choice in image denoising</title>
  <link>https://arxiv.org/abs/2606.00122</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00122v1 Announce Type: new Abstract: We approach image denoising from a perception-driven perspective: how can we select the parameters that are best suited for human visual perception? We combine research methods in mathematics and psychology to develop a mathematical framework for measuring perceived similarity. We construct a sample set of differently denoised photographs by using the same base image as input data and by tuning the parameter value in a total variation denoising algorithm. A comparison test is conducted with human participants to survey perceived differences between the images. Analyzing the results with psychometric scaling provides us with a HaarPSI value to use as a threshold in discretizing parameter grids. As a result, we obtain psychometrically scaled, openly available image sets that are ready to use in further experiments in perception-driven imaging, as well as a framework for ensuing experiments involving comparison tests.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Bounding Global and Local Compression Error of Signal Parameterizations</title>
  <link>https://arxiv.org/abs/2606.00126</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00126v1 Announce Type: new Abstract: Differentiable signal parameterizations such as implicit neural representations (INRs) and hybrid models are increasingly central to computational imaging, yet principled tools for evaluating reconstruction fidelity at finite model size remain limited when ground truth is unavailable. We introduce a framework for predicting the reconstruction error of compressive signal parameterizations, yielding non-asymptotic, signal-specific bounds that are both theoretically sound and efficiently computable without access to the ground truth signal. Specifically, we prove that when parameterization-based compression satisfies certain natural properties, the compression error at any compression level is bounded by a simple scaled difference between model predictions at different compression levels. We verify these properties for representative model families including interpolated grids, Fourier feature networks, multi-resolution hash encodings, and tensor factorizations, and show empirically that the resulting worst-case guarantees can be efficiently adapted into signal-specific error predictors that are tight and generalizable. Across direct fitting of synthetic and natural signals, and inverse problems including radiance field and MRI reconstruction, our method closely tracks global error curves and yields informative local error heatmaps without ground-truth access. Code is available at https://github.com/voilalab/global_error_bound.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts</title>
  <link>https://arxiv.org/abs/2606.00146</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00146v1 Announce Type: new Abstract: Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A physics-informed foundation model for quantitative diffusion MRI</title>
  <link>https://arxiv.org/abs/2606.00156</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00156v1 Announce Type: new Abstract: Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides the only noninvasive window into whole-brain microstructure in vivo, yet reliable quantitative mapping remains confined to specialized research settings requiring dense sampling and optimized acquisition protocols. To address this gap, we present a physics-informed generative microstructure network (PIGMENT) that learns a universal generative prior of human brain microstructure and adapts it zero-shot to each participant&#39;s measured data to recover subject-specific maps. Trained on 11375 scans spanning multiple sites, vendors, and field strengths, PIGMENT enabled reliable quantitative mapping for tensor, kurtosis, and NODDI models across external datasets from five independent centers. It remains effective where conventional fitting becomes unreliable, recovering meaningful maps from extremely sparse acquisitions while supporting downstream tractography and structural connectivity mapping. PIGMENT estimates demonstrated strong biological validity, preserving submillimeter cortical microarchitectural patterns and early-childhood white matter developmental trajectories from 10-fold accelerated scans. Furthermore, PIGMENT enables reliable quantitative tensor mapping on cost-efficient low-field systems and the extraction of tumor-related biomarkers using ultra-fast clinical protocols. Together, these results establish PIGMENT as a physics-informed foundation model that extends quantitative diffusion MRI into regimes traditionally too sparse, heterogeneous, or clinically constrained for reliable analysis.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Training-Free Continuous Bitrate Control for Scalable Image Coding for Humans and Machines</title>
  <link>https://arxiv.org/abs/2606.00158</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00158v1 Announce Type: new Abstract: Continuous variable-rate compression is highly demanded in real-world applications, but remains underexplored in scalable image coding for humans and machines. In this paper, we propose a training-free variable-rate scalable image coding framework. By adjusting quantization steps based on predicted scale values, the proposed method achieves continuous bitrate control while preserving high-scale information in the machine and enhancement layers. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of bitrate allocation between the two layers.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>AutoIQ: An Ensemble Framework for Automatic Assessment of Geometric Distortion in Prostate Diffusion-Weighted Imaging</title>
  <link>https://arxiv.org/abs/2606.00393</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00393v1 Announce Type: new Abstract: Geometric distortion in prostate diffusion-weighted imaging (DWI) can impair lesion localization and reduce the reliability of MRI-based clinical assessment. We propose AutoIQ, an ensemble machine learning framework for automatic quantification and classification of DWI geometric distortion severity. A total of 140 retrospective prostate biparametric MRI examinations were analyzed, including 33 scans with severe distortion requiring repeat acquisition and 107 scans with acceptable distortion based on expert radiologist assessment. AutoIQ combines two complementary distortion quantification strategies: a segmentation-based method measuring prostate boundary mismatch between T2-weighted imaging (T2WI) and DWI, and a registration-based method estimating deformation magnitude after DWI-to-T2WI alignment. The resulting distortion scores were used to train individual classifiers and a logistic-regression ensemble model. Both computational methods significantly differentiated severe from acceptable distortion cases (p &lt; 0.001). On an independent test set, the ensemble model achieved an accuracy of 0.95, F1-score of 0.93, and AUC of 0.98, outperforming individual models. These results suggest that AutoIQ can provide automated, quantitative quality assessment for prostate DWI and may help identify scans that require repeat acquisition.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>RFDT-Channel: RGB-LiDAR-Based RF Digital Twin Scene Construction for 28 GHz Indoor Ray-Tracing Channel Simulation</title>
  <link>https://arxiv.org/abs/2606.01261</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01261v1 Announce Type: new Abstract: Real-scene indoor millimeter-wave simulation requires efficient modeling of radio frequency (RF)-computable geometry and electromagnetic material properties. To address the low efficiency of manual scene modeling, the limited RF adaptability of visually reconstructed meshes, and the lack of material binding in 28 GHz ray-tracing simulation, RFDT-Channel is developed as an RF digital twin scene construction workflow based on red-green-blue (RGB) images and light detection and ranging (LiDAR) point clouds. Indoor videos and point clouds are collected by a Jetson Orin platform with LiDAR and GMSL cameras. An initial triangular mesh is generated through COLMAP, 3D Gaussian Splatting, and SuGaR. The LiDAR point cloud then provides geometric and scale references for RF-oriented regularization in Blender, including alignment, wall solidification, door/window opening construction, and topology repair. OpenScene semantic segmentation maps major indoor structures to concrete, glass, wood, and metal materials, and Sionna RT performs 28 GHz ray tracing. Under a fixed transmitter-receiver deployment, the generated channel impulse response (CIR), channel frequency response (CFR), and Radio Map results show that material binding mainly changes weak reflection, transmission, and scattering paths, reducing the number of effective paths from about 742 to about 52 while keeping the dominant path amplitude nearly unchanged.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI</title>
  <link>https://arxiv.org/abs/2606.01293</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01293v1 Announce Type: new Abstract: Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model&#39;s ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>PINNOCHIO: Physics-Informed Neural Network for Coupled Hyperelastic Interface-Volume Simulation in Orthognathic Surgery</title>
  <link>https://arxiv.org/abs/2606.01572</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01572v1 Announce Type: new Abstract: Predicting patient-specific facial soft-tissue deformation is critical for iterative orthognathic surgery planning. However, current computational methods face a strict accuracy-efficiency trade-off: high-fidelity Finite Element Methods (FEM) are computationally prohibitive, whereas pure deep learning models often produce biomechanically inconsistent results. While Physics-Informed Neural Networks (PINNs) offer a promising avenue, learning the complex heterogeneous mechanics of bone--soft-tissue interactions with only partial clinical supervision (i.e., outer facial surfaces) remains highly unstable. To overcome these challenges, we present PINNOCHIO, a novel physics-informed framework for facial soft-tissue simulation. PINNOCHIO introduces a hybrid sequential decomposition that explicitly decouples discontinuous bone--soft-tissue interface movements from continuous volumetric hyperelastic deformation. This structural separation enables stable training and facilitates a physics-enabled sim-to-real adaptation strategy, ensuring internal biomechanical consistency without requiring volumetric ground truth. Evaluated on a 40-patient clinical cohort, PINNOCHIO outperforms existing baselines in both surface accuracy and physical validity. Furthermore, it achieves a substantial speedup over FEM, successfully resolving the accuracy-efficiency trade-off to provide a highly reliable and practical tool for interactive surgical planning.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Regularized joint reconstruction and slab combination for accelerated three-dimensional multi-slab diffusion-weighted imaging using multi-scale energy models</title>
  <link>https://arxiv.org/abs/2606.01606</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01606v1 Announce Type: new Abstract: This work presents Energy-based Profile Encoding, EPEN, a joint reconstruction framework for high-resolution diffusion-weighted MRI from undersampled 3D multi-slab k-space acquisitions, designed to suppress slab-boundary artifacts while preserving fine anatomical detail. EPEN formulates the multi-slab acquisition process using a bilinear forward model in which both the diffusion-weighted image volume and slab excitation profiles are treated as unknown variables. Reconstruction is posed as a maximum a posteriori optimization problem with three components: a Gaussian data-fidelity term enforcing consistency with the acquired k-space measurements, a CNN-based deep energy prior that represents the negative log distribution of clean diffusion-weighted images, and a quadratic regularization term that constrains the estimated slab profiles toward an initial profile estimate. The gradient of the learned energy prior guides accelerated reconstruction toward an artifact-free image distribution. The resulting nonconvex objective is solved using alternating minimization, with image-volume updates performed through a majorize-minimize scheme using conjugate-gradient optimization and slab-profile updates estimated by regularized least squares. Across multiple acceleration factors and slab configurations, EPEN substantially reduced slab-boundary artifacts compared with conventional slab-boundary correction methods, while improving structural consistency and preserving diffusion-weighted contrast. These results demonstrate that EPEN enables robust joint 3D multi-slab diffusion MRI reconstruction and slab-profile correction within a unified optimization framework supported by deep energy-based image priors.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>MoRE: A Mixture-of-Experts-Based Task-Adaptive End-to-End Network for Multimodal MRI Reconstruction</title>
  <link>https://arxiv.org/abs/2606.01784</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01784v1 Announce Type: new Abstract: Although accelerated MRI reconstruction has advanced rapidly through end-to-end learning, deploying a single unified network that generalizes across diverse anatomies and contrasts under constrained computational resources remains challenging. In this paper, we introduce MoRE, a sparsely activated mixture-of-experts (MoE) module integrated into an end-to-end variational network. MoRE couples a shared encoder with sample-wise, unsupervised routing to activate a minimal subset of expert decoders while strictly preserving physics-based data consistency. Evaluated on the fastMRI multi-coil brain and knee datasets under 8x undersampling, MoRE achieves highly stable SSIM and PSNR performance across multi-contrast datasets. Furthermore, t-SNE visualization of the routing embeddings reveals interpretable, modality-aware expert specialization. The sparse conditional computation mechanism ensures that the architectural overhead remains modest. These results demonstrate that MoE-style capacity scaling can significantly enhance general-purpose MRI reconstruction without requiring proportional increases in computational power.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Face Liveness Detection Using RGB and Thermal Image Fusion</title>
  <link>https://arxiv.org/abs/2606.01836</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01836v1 Announce Type: new Abstract: Face detection with visible-spectrum cameras can capture facial features, but it often fails to distinguish live subjects from spoof sources such as photographs, masks, or statues. Previous approaches based on texture, motion, or physiological cues are sensitive to illumination changes and show limited robustness against spoofing attacks. Thermal imaging helps overcome these limitations by detecting heat emissions, naturally excluding spoof faces. This study proposes a hybrid approach that fuses the edge information of RGB images with corresponding thermal images using a custom ARISTOF dataset containing live and spoof faces. The fused images are first evaluated using the YOLOv8-Face model to compare face detection performance across RGB, thermal, and fused modalities. The results show that the proposed method enhances the face detection accuracy of thermal images. The fused images are subsequently used to train a YOLOv8-Face model for live and spoof classification, demonstrating that the proposed multimodal fusion effectively supports robust face liveness detection.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>LALE: Lightweight-Transformer Architecture for Land-Cover Estimation</title>
  <link>https://arxiv.org/abs/2606.02092</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02092v1 Announce Type: new Abstract: Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel</title>
  <link>https://arxiv.org/abs/2606.02156</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02156v1 Announce Type: new Abstract: Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection</title>
  <link>https://arxiv.org/abs/2606.00098</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00098v1 Announce Type: cross Abstract: We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3&#39;s patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3&#39;s CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance</title>
  <link>https://arxiv.org/abs/2606.01277</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01277v1 Announce Type: cross Abstract: Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks</title>
  <link>https://arxiv.org/abs/2606.01432</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01432v1 Announce Type: cross Abstract: Accurate modeling of leaf spectral reflectance from physiological and biochemical traits is essential for advancing remote sensing applications in plant science and precision agriculture. Widely used radiative transfer models, such as PROSPECT-PRO, rely on generalized trait-reflectance relationships developed from a wide range of species, which may not fully capture the spectral behavior of specific crops like grapevines. In this study, we developed a trait-to-spectra prediction model using a multi-head attention neural network trained on a grapevine-specific dataset that includes 16 leaf traits measured across multiple varieties, growth stages, and years. The model was evaluated using stratified 5-fold cross-validation and achieved an average coefficient of determination (R^2) of 0.84 and normalized root mean squared error (NRMSE) of 1.52 percent, demonstrating high accuracy and generalizability. When compared to PROSPECT-PRO in forward mode, the neural network exhibited lower mean absolute error (MAE), especially in the near-infrared (NIR) and shortwave-infrared (SWIR) regions. These results emphasize the importance of species-specific modeling approaches and show that integrating biochemical and structural traits into data-driven architectures can significantly improve spectral prediction. The proposed model provides a robust framework for generating accurate leaf-level reflectance data, with potential applications in canopy trait retrieval, vineyard monitoring, and remote sensing-driven crop management.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Hist2Style: Histogram-Guided Stylization with Bilateral Grids</title>
  <link>https://arxiv.org/abs/2606.01819</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01819v1 Announce Type: cross Abstract: Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization</title>
  <link>https://arxiv.org/abs/2606.02000</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.02000v1 Announce Type: cross Abstract: Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency</title>
  <link>https://arxiv.org/abs/2411.15076</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2411.15076v3 Announce Type: replace Abstract: Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment&#39;s stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Towards Networked One Search Agent Systems: Multilateration of WiFi Fine Time Measurement Responders Using GNSS References</title>
  <link>https://arxiv.org/abs/2606.00075</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00075v1 Announce Type: new Abstract: This paper presents a proof-of-concept system for localising ground-based WiFi access points, acting as IEEE~802.11mc Fine Time Measurement (FTM) responders, from an uncrewed aerial vehicle using FTM ranging and Global Navigation Satellite System (GNSS)-referenced moving-baseline multilateration. Each associated GNSS-referenced FTM-initiator pose supplies a known reference point, turning the flight trajectory into a temporal multilateration problem. The real-time smartphone pipeline performs GNSS--ranging time association, robust outlier gating, a two-stage Gauss-Newton bootstrap, and sequential Bayesian filtering with bias tracking. Six measurement-noise configurations, including empirical and adaptive models, are evaluated on field data collected in unstructured, mountainous terrain. For a line-of-sight access point with \num{455} ranging measurements, the online Android pipeline achieves a final horizontal error of \SI{4.4}{\metre}, while offline replay of the same flight yields a time-weighted mean horizontal error of \SI{4.7}{\metre} and a best-case final horizontal error of \SI{1.1}{\metre} under the best noise model after a close flyby. For non-line-of-sight targets, the real-time pipeline does not converge because of limited measurement availability, weak geometry, and signal attenuation, although an offline robust least-squares solver recovers a coarse solution for the vegetation-only case. The system is intended as a building block for Networked One Search Agent architectures, and preliminary middleware tests demonstrate software-level interoperability, while quantitative multi-agent accuracy is left for future work.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces</title>
  <link>https://arxiv.org/abs/2606.00106</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00106v1 Announce Type: new Abstract: Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by {\alpha}, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning {\alpha} yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Motif-based morphology signatures for interpretable ECG screening and monitoring</title>
  <link>https://arxiv.org/abs/2606.00107</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00107v1 Announce Type: new Abstract: Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical practice relies on brief resting ECGs and, when required, long-duration ambulatory recordings, both generating data that require resource-intensive review. Consequently, subtle morphological changes or progressive drift preceding clinically apparent abnormalities may go unnoticed. We propose a motif-based framework that defines beat-aligned ECG motifs as interpretable cardiac signatures and quantifies morphological drift and deviation across short and long-term monitoring. Motifs are representative cardiac cycles capturing dominant morphology. We introduce three interpretable drift metrics: deviation from a normal sinus rhythm (NSR), deviation from a personalised baseline, and a motif instability index. Motifs are extracted by selecting beats that minimise Dynamic Time Warping (DTW) distance within fixed windows. We evaluate these metrics on short (PTB-XL) and long-duration (MIT-BIH Arrhythmia) ECG datasets. Interpretability is achieved through representative motif overlays and fiducial-based visualisations, enabling direct inspection of morphological changes. In MIT-BIH, the proposed metrics significantly separated predominantly normal from arrhythmic subjects (p&lt;0.01). In PTB-XL, NSR deviation distinguished normal from abnormal ECGs across major diagnostic subtypes (p&lt;1e-4, Cliff&#39;s delta up to 0.93). ECG motifs provide an interpretable representation of cardiac morphology, supporting scalable longitudinal monitoring and early detection of morphology-driven change.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Project SPARROW and the Future of Conservation Technology</title>
  <link>https://arxiv.org/abs/2606.00108</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00108v1 Announce Type: new Abstract: Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constraints in power, connectivity, and accessibility. We present SPARROW, a hardware and software open-source platform that integrates solar energy, edge artificial intelligence, and satellite communication to enable continuous, autonomous biodiversity monitoring in remote environments. Each SPARROW node combines a low-power Graphics Processing Unit (GPU) with modular visual, acoustic, and environmental sensors, performing on-device deep learning inference and transmitting summarized results through Low-Earth-Orbit (LEO) satellite or Global System for Mobile Communications (GSM) networks. We deployed SPARROW across tropical, temperate, and montane ecosystems in Colombia, Peru, Tanzania, and the United States, where it sustained 24/7 operation under variable environmental conditions and collected more than two million images and acoustic recordings in the first 190 days. The system demonstrated robust real-time classification and adaptive power management, achieving full autonomy without on-site human intervention. By integrating renewable energy, on-edge AI, and open-source design, SPARROW lowers the technical and financial barriers to ecological monitoring and establishes a scalable foundation for a distributed, intelligent network of sensors, an emerging &quot;Internet of Living Things&quot; for planetary biodiversity monitoring.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction</title>
  <link>https://arxiv.org/abs/2606.00120</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00120v1 Announce Type: new Abstract: This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neural network (ANN)-based transformers for wireless foundation models (WFMs). Inspired by the noise-robust and energy-efficient information processing in the human brain, SpikeWFM aims to enhance the resilience of WFMs against noise and interference while maintaining strong generalization capabilities across diverse wireless scenarios. Drawing from the success of large language models, WFMs leverage self-supervised pre-training on large-scale datasets spanning various wireless environments to learn a unified embedding that supports a wide range of downstream tasks, including channel prediction, channel estimation, beam predition, positioning and etc. Such models typically outperform task-specific designs and exhibit superior adaptability to unseen conditions. However, existing WFMs remain vulnerable to realistic noise and interference in practical wireless systems. To address this limitation, we incorporate spiking neurons into the transformer-based WFM architecture. We provide a brief theoretical analysis demonstrating how the SNN-ANN hybrid effectively mitigates noise and interference through temporal sparsity and event-driven processing. Experimental results show that SpikeWFM consistently outperforms conventional ANN-based WFMs in both pre-training convergence and channel prediction accuracy. Additional results on communication and sensing tasks will be presented in the full journal version of this work.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>ReFLEX: Length-Generalizable CSI Denoising for MIMO-OFDM via Relative-Frequency Bias</title>
  <link>https://arxiv.org/abs/2606.00263</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00263v1 Announce Type: new Abstract: This letter studies CSI denoising for MIMO--OFDM with variable NR resource block (RB) allocations. ReFLEX is a length-generalizable Transformer whose frequency attention uses a relative-frequency position bias (RFPB) generated from subcarrier offsets. A single checkpoint handles unseen RB lengths and can be applied to sparse DM-RS observations in the tested RB5/RB10 PUSCH setup without retraining. In a 3GPP~TR~38.901 UMa NLOS channel, ReFLEX achieves about $-9.6$~dB NMSE on unseen RB lengths. In NR PUSCH/UL-SCH simulations, ReFLEX denoising followed by time-frequency interpolation reduces the 10\% BLER threshold by about 2--3~dB.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Radar-Assisted Beam Management Framework for mmWave NTNs: Overhead Reduction and Physical Layer Security Application</title>
  <link>https://arxiv.org/abs/2606.00277</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00277v1 Announce Type: new Abstract: Fast and low-overhead beam management is a critical requirement for the practical deployment of non-terrestrial networks (NTNs) operating at millimeter-wave and higher frequencies. In this paper, we propose a radar-assisted beam selection framework for NTNs that limits the set of candidate beams by utilizing spatial sensing information such as the angle-of-departure (AoD) and distance estimations. To provide theoretical insight into the expected worst-case overhead, we conduct a probabilistic analysis under idealized conditions, where an approximation of the worst-case beam selection overhead is proposed and its statistics are derived under Gaussian error. Additionally, the proposed framework is applied to a physical-layer security (PLS) scenario by leveraging the radar&#39;s capability to detect passive targets that represent unintended users. The simulation results show that the unintended user&#39;s power is suppressed below -135 dBm, while an additional beamforming gain of roughly 2 dB is attained for the legitimate users.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Ambiguity Analysis and Design of Sparse Arrays via Generalized Vandermonde Rank Conditions</title>
  <link>https://arxiv.org/abs/2606.00360</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00360v1 Announce Type: new Abstract: Sparse linear arrays obtained by thinning a uniform linear array (ULA) achieve large effective apertures with a reduced number of physical sensors and have become a key enabling technology across radar, sonar, communications, and integrated sensing and communications. The price of thinning, however, is the emergence of ambiguities in the array manifold: distinct sets of directions of arrival that produce identical sensor measurements, precluding unique identification of multiple sources. Conventional sparse-array design criteria, based on beampattern shaping or estimation-performance optimization, do not fully capture how multiple steering vectors interact jointly to produce such ambiguities. This paper develops a scalable algebraic framework for the multi-source identifiability analysis of thinned ULAs. By relating the rank deficiency of the generalized Vandermonde matrix associated with the sparse steering matrix to that of a thinned Toeplitz matrix, and further to a rank condition on an augmented full-ULA steering matrix with prescribed generators, we obtain a systematic characterization of the ambiguity sets in large sparse arrays together with constructive design guidelines for ambiguity-free geometries. Algebraic and numerical examples demonstrate that the proposed framework characterizes ambiguity sets at scales well beyond the practical reach of previous sparse-array design and synthesis methods</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Channel Estimation for Movable Intelligent Surface</title>
  <link>https://arxiv.org/abs/2606.00387</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00387v1 Announce Type: new Abstract: This paper proposes a tensor-based channel estimation framework for an uplink MIMO system assisted by a movable intelligent surface. The considered architecture combines a fixed transmissive metasurface with a smaller movable layer, whose discrete positions create an additional structured training dimension. By jointly exploiting fixed-layer phase patterns and movable-layer positions, the received pilots are modeled as a fourth-order PARAFAC tensor. A trilinear alternating least-squares receiver is then derived to estimate the individual channels and the position-dependent response. Importantly, the proposed method does not require prior knowledge of the movable-layer phase response at the receiver, since this unknown factor is estimated from the tensor structure of the received signal. Simulation results show that increasing the training length improves the NMSE of the estimated factors and the reconstructed cascaded channel.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Joint Channel and Symbol Estimation for RIS-Assisted Fluid Antenna Systems</title>
  <link>https://arxiv.org/abs/2606.00423</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00423v1 Announce Type: new Abstract: This paper addresses joint channel and symbol estimation in reconfigurable intelligent surface (RIS)-aided multiuser uplink systems with fluid antennas (FAs) at the base station. We propose the Nested Tucker for Fluid Antenna Systems (NTFAS) protocol, in which FA port selection and user-dependent coding vary across blocks while the transmitted symbol matrix is shared across observations. This structure yields coupled Tucker models with common channel and data factors. A two-stage semi-blind bilinear alternating least squares (BALS) receiver is then developed to estimate the cascaded channel and symbols, and to separate the user-to-RIS and RIS-to-BS channels through the embedded PARAFAC structure. Simulations show that NTFAS improves cascaded-channel NMSE and spectral efficiency (SE) with respect to a competing semi-blind benchmark, while maintaining comparable BER performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Performance of DF Multihop Networks with TAS/GSC over Nakagami-m Fading Channels</title>
  <link>https://arxiv.org/abs/2606.00598</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00598v1 Announce Type: new Abstract: In this work, transmit antenna selection (TAS) and generalized selection combining (GSC), i.e., TAS/GSC is revised over independent identically distributed Nakagami-$m$ flat fading channels with pretty simple newly derived closed-form expressions of outage probability (OP), symbol error rate (SER), and ergodic capacity. While compares to their multinomial theorem-based counterparts for GSC and TAS/GSC, the intelligibility, practicality, and simplicity of our derivations are invaluable, which from now on facilitates TAS/GSC implementations in various fields. As an example, performance analysis of decode-and-forward multihop networks with TAS/GSC implementation in each hop is presented over independent non-identically distributed Nakagami-$m$ fading channels in this work, with the closed-form expressions for OP, SER, and ergodic capacity. Finally, all derived analytical expressions are validated via Monte-Carlo simulation technique.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bending beams behind corners: mechanisms, challenges and capabilities for wireless connectivity</title>
  <link>https://arxiv.org/abs/2606.00678</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00678v1 Announce Type: new Abstract: Curved beams, that is, beams that are able to propagate on nonlinear trajectories, are often envisioned as ideal candidates for blockage avoidance in future wireless connectivity. Owing to this unique feature, they are considered as ideal beams for bending around and behind corners to reach users beyond the line-of-sight (LoS), thus offering unprecedented connectivity. In this work, we explain the various mechanisms of beam propagation beyond the LoS, and we demonstrate that beam bending behind corners results from an interplay between wavefront engineering and edge diffraction, with distinct characteristics that depend on the extent of blockage and the beam formation efficiency. We identify three distinct regimes of operation, namely the unblocked, the partially blocked, and the fully blocked regime, and we show that beam bending through wavefront engineering dominates in the unblocked and partially blocked regimes, while edge diffraction dominates in the fully blocked regime; as a result, curved beams cannot really bend behind the corner, unless there is some LoS between the user and the transmitter. Based on our findings, we compare curved beams with focused beams, and we demonstrate that they perform similarly in the partially blocked regime, while focused beams outperform curved beams in the unblocked and fully blocked regimes.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Rethinking NB-IoT Downlink Synchronization for LEO-NTN: A Novel Overhead Reduction Method and Measurement-Based Evaluation</title>
  <link>https://arxiv.org/abs/2606.00701</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00701v1 Announce Type: new Abstract: Narrowband Internet of Things (NB-IoT) over non-terrestrial networks (NTN) is a key enabler for massive Internet of Things (IoT) in 6G, but in low Earth orbit (LEO) scenarios, large and time-varying Doppler shifts generate carrier frequency offset (CFO) beyond the correction range of standard user equipment (UE), making initial downlink synchronization a major bottleneck. This paper analyzes Doppler characteristics in realistic NB-IoT LEO scenarios, reviews Doppler mitigation strategies, and proposes a standard-compliant, low-overhead search-space optimization method for downlink acquisition. Results under realistic LEO conditions with real-time measurements show reduced acquisition overhead while maintaining synchronization reliability, supporting NB-IoT adaptation to 6G NTN deployment.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Toward Agile and Cooperative LEO Satellite Beam-Hopping Networks: Paradigms, Challenges, and Opportunities</title>
  <link>https://arxiv.org/abs/2606.00743</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00743v1 Announce Type: new Abstract: Low-Earth orbit (LEO) satellite beam-hopping (BH) technology is emerging as a promising approach to meet the ever-increasing global connectivity demands, enabling agile, on-demand coverage. LEO satellite BH can address the spatio-temporal non-uniformity of ground user traffic by dynamically allocating capacity and optimizing network performance. Cooperative multi-satellite BH enables joint transmission and interference avoidance to improve received signal quality. This article provides a comprehensive paradigm of BH, detailing its key dimensions, strategies, and architectures. Through exploration of key challenges, including beam pattern design, on-demand scheduling, and interference management, this paper identifies the potential applications of BH, ranging from adaptive capacity allocation for hotspot areas, low-power Internet-of-Things (IoT), delay-sensitive services, to massive connectivity support. Furthermore, a system-level analysis is presented, including key metrics, models of inter-beam and inter-satellite interference, and cooperative joint transmission, and a case study is provided to demonstrate the performance benefits of BH with cooperative transmission. Several promising future research directions are discussed to guide the future development of LEO satellite BH networks.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Optimal Routing and Link Configuration for Covert Heterogeneous Wireless Networks in the Presence of a Friendly Jammer</title>
  <link>https://arxiv.org/abs/2606.00848</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00848v1 Announce Type: new Abstract: In modern radio networks, nodes frequently access multiple communication interfaces such as WiFi, cellular, LoRa, and Zigbee. Optimal utilization of such heterogeneous networks (HetNets) at link and network levels is essential for ensuring efficient and secure communication. Some applications require a high level of security, requiring the signal to be completely undetectable. Previous works have considered such covertness, but it often results in limited achievable rates. Physical layer analysis shows that friendly jamming can significantly improve covert data rates, motivating its incorporation into HetNets. Here, we analyze a scenario where a jammer assists communication in a HetNet in the presence of an adversary attempting signal detection. We first optimize the physical layer (PHY) for a single link and then incorporate those results into an optimal routing and link configuration approach that accounts for an adversary observing the aggregate signals from all links. Numerical results demonstrate significant performance gains when compared to alternative approaches. In fact, the rate observed for the proposed approach is high enough to question the optimality of the low rate design approach employed; we address this concern through revised algorithms and characterize their performance.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>TenSIM: Tensor-Based Channel Estimation for MIMO Systems with Stacked Intelligent Metasurfaces</title>
  <link>https://arxiv.org/abs/2606.00917</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00917v1 Announce Type: new Abstract: Stacked intelligent metasurfaces (SIMs) are emerging as a promising architecture for the sixth generation (6G) and beyond of wireless systems, enabling richer electromagnetic-wave manipulation than conventional single-layer metasurfaces. However, realizing these gains requires accurate and scalable channel estimation under the strong inter-layer coupling and multilinear parameter interactions introduced by the stacked programmable metasurface layers. This paper proposes TenSIM, a tensor-based channel-estimation framework for SIM-assisted multiple-input multiple-output (MIMO) systems. By exploiting a structured SIM training protocol, TenSIM derives two parity-dependent observation models: a PARAllel FACtor (PARAFAC) model for odd-layer SIMs and a Tucker model for even-layer SIMs. These formulations decouple the transmitter-SIM and SIM-receiver channels while explicitly accounting for inter-layer wave coupling. Based on the resulting tensor models, we develop alternating least squares estimators, establish identifiability conditions using the associated design matrices, and characterize practical sufficient conditions for full-column-rank training designs, including those involving scaling ambiguities. The proposed framework is validated through extensive numerical experiments and reveals the main operating trade-offs. We show that both TenSIM-PARAFAC and TenSIM-Tucker outperform unstructured least squares baselines by exploiting the tensor structure of the SIM cascade. Moreover, TenSIM-PARAFAC offers better scalability, lower computational complexity, and stronger robustness to inter-layer spacing, while TenSIM-Tucker can provide more accurate channel reconstruction when sufficient training and strong layer coupling are available. Finally, it is shown that the proposed TenSIM framework remains effective under imperfect or blind SIM training when additional pilot diversity is available.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Electromagnetic Digital Twin-Enabled Closed-Loop Beam Management in ISAC Systems</title>
  <link>https://arxiv.org/abs/2606.00977</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00977v1 Announce Type: new Abstract: Digital twin (DT) is envisioned as a key enabler of sixth-generation (6G) communication systems, evolving from offline descriptive replicas for monitoring and analysis to inthe-loop agents within digital twin networks (DTNs) that couple physical and digital worlds. Recent advances in integrated sensing and communication (ISAC)-driven electromagnetic (EM) scattering methods enable environment twinning by linking channel behaviors to EM properties of the scatterers, supporting interpretable DT states and EM-grounded optimization. However, existing studies primarily focus on DT construction and lack mechanisms for closed-loop control in wireless systems. Moreover, array-geometry mismatch can bias DT reconstruction and degrade control performance, while prior works assume known arrays. To address these gaps, we propose an EM-ISACbased closed-loop DTN framework with a hierarchical design integrating environment twinning, prior injection, and control decision into an end-to-end loop. Leveraging ISAC measurements, the proposed framework jointly reconstructs scatterer information and array-dependent forward operator and employs a low-complexity Bayesian message-passing algorithm to perform contrast inference and array calibration. The reconstructed DT guides codebook preselection to reduce training overhead and narrow candidate beams. Subsequently, downlink beamforming (BF) is performed based on DT-predicted channels, enabling latency-bounded closed-loop control. Simulation results demonstrate improved robustness and control performance under array mismatch.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Multicast Capacity of XL-RIS Assisted Hybrid Near- and Far-Field mmWave Communications</title>
  <link>https://arxiv.org/abs/2606.01133</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01133v1 Announce Type: new Abstract: Multicast transmission in millimeter-wave (mmWave) networks is fundamentally limited by the weakest user, and blockages further exacerbate this problem. Large-scale reconfigurable intelligent surfaces (XL-RIS) offer a promising solution by providing high array gain to overcome blockages. However, the large aperture of XL-RIS significantly expands the near-field region, creating a hybrid-field scenario where some users lie in the near-field while others remain in the far-field. Existing hybrid-field studies on XL-RIS have primarily focused on channel estimation and deployment optimization, leaving multicast capacity analysis unexplored. This paper investigates the fundamental capacity limits of XL-RIS-assisted multicast communications in hybrid-field scenarios. For the fundamental two-user case consisting of one near-field and one far-field user, we derive the optimal closed-form covariance matrix and optimize the RIS phase shifts via manifold optimization. We establish that the multicast capacity scales as $\Theta(\log_2(MN))$ as the number of transmit antennas M and/or RIS elements N grow large, and prove this scaling is order-tight. Numerical results validate the bounds and show the impact of M, $N$, and distance on the multicast rate.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Flexible Rate-Splitting for Joint Unicast and Multi-Group Multicast Transmission in RIS-Assisted mmWave Networks</title>
  <link>https://arxiv.org/abs/2606.01144</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01144v1 Announce Type: new Abstract: Joint unicast and multi-group multicast transmission with RIS and RSMA is a promising enabler for 6G services. However, existing RSMA schemes for such scenarios split only unicast messages while leaving multicast messages intact, limiting the degree of freedom of interference management. To this end, we propose a joint rate splitting framework that splits both unicast and multicast information and two RSMA schemes. The common-common fusion (CCF-RSMA) scheme encodes the unicast common part into the global multicast common stream, while the private-common fusion (PCF-RSMA) scheme merges it with the group-specific multicast private part. For each scheme, we formulate energy efficiency (EE) maximization problems under both perfect and imperfect channel state information, and jointly optimize active beamforming, RIS phase shifts and rate allocation parameters. Simulation results demonstrate that the proposed schemes significantly outperform the comparative schemes in terms of EE, thereby proving the effectiveness of the proposed framework. Moreover, CCF-RSMA is more favorable in scenarios with larger groups and higher unicast QoS demands, whereas PCF-RSMA is better suited for scenarios with smaller groups and higher multicast QoS.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>MAC Performance and Algorithmic Optimization in Matrix Multiplication Workloads</title>
  <link>https://arxiv.org/abs/2606.01174</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.01174v1 Announce Type: new Abstract: Matrix multiplication is a fundamental computational kernel underlying a wide range of real-world applications, including machine learning, scientific computing, signal processing, and computer graphics. Its performance directly impacts the efficiency, scalability, and energy consumption of modern computing systems. This paper presents a comparative analysis of several matrix multiplication algorithms implemented in software and examined in the context of their hardware execution characteristics. Naive, NumPy, Strassen, and Winograd algorithms are evaluated based on execution time, user time, and CPU time across increasing matrix sizes. The performance metrics reveal computational bottlenecks and highlight the benefits of algorithmic optimizations. Furthermore, the study investigates the mathematical operations underlying each algorithm and analyzes how matrix dimensions influence MAC (Multiply-Accumulate) behavior and overall computational efficiency in the hardware domain. The results provide a performance benchmark and contribute to understanding how algorithmic choices interact with modern computing architectures for applications in computer architecture, data science, and real-time embedded systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Invascal: Inverse-Vacuity Self-Calibration for Uncertainty-Aware LiDAR Range-View Semantic Segmentation</title>
  <link>https://arxiv.org/abs/2606.00069</link>
  <pubDate>Tue, 02 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2606.00069v1 Announce Type: cross Abstract: LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real-time use. To this end, we introduce a novel, architecture-agnostic uncertainty-aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse-vacuity self-calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well-calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>FiPA-SR -- FiLM-Conditioned Perceptually Informed Audio Super-Resolution</title>
  <link>https://arxiv.org/abs/2605.30594</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30594v1 Announce Type: new Abstract: Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>OpenSTBench: Beyond Semantic Evaluation for Speech Translation</title>
  <link>https://arxiv.org/abs/2605.30792</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30792v1 Announce Type: new Abstract: Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment</title>
  <link>https://arxiv.org/abs/2605.30965</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30965v1 Announce Type: new Abstract: Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue</title>
  <link>https://arxiv.org/abs/2605.30993</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30993v1 Announce Type: new Abstract: Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>On the Use of Dereverberation for Acoustic Feedback Cancellation</title>
  <link>https://arxiv.org/abs/2605.31101</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31101v1 Announce Type: new Abstract: In public address systems and hearing aids, the maximally achievable amplification or gain is limited by acoustic feedback. Therefore, in order to be able to apply a higher gain, feedback cancellation methods are required. In addition, it is oftentimes also desirable to dereverberate a recorded signal, that is, remove the late reverberation component of the signal, before playing it back. In this paper, it is shown that under two mild conditions, the acoustic feedback signal can be written as a reverberant version of the source signal. Therefore, it is possible to treat the joint dereverberation and acoustic feedback cancellation problem as a dereverberation-only problem, meaning that dereverberation algorithms can be applied to the joint problem. Simulations corroborate this finding</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Improving acoustic drone detection generalization through pretraining and data augmentation</title>
  <link>https://arxiv.org/abs/2605.31329</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31329v1 Announce Type: new Abstract: Detecting unauthorized UAV flights is critical for surveillance, security, and airspace management. Acoustic drone detection, which relies on the distinctive propeller and motor sounds of UAVs, provides a low-cost, passive solution that requires no line of sight. A central challenge is generalization: reliably distinguishing drone signatures from ambient noise across unseen recording setups, environments, and UAV types (out-of-domain). Inspired by advances in large-scale audio pretraining, we develop a compact DNN-based detector and improve its generalization by (1) pretraining the model for broad sound-event classification before fine-tuning on diverse in-house and public drone recordings, and (2) applying on-the-fly augmentations (pitch shifting, noise mixing, microphone transfer function simulation, spectrogram augmentation) to expose the model to varied acoustic conditions. An ablation study quantifies the impact of each augmentation. For evaluation, we set target false-positive rates (FPR) aligned with real-world surveillance needs and report true-positive rates (TPR) on both in-domain data (public IDMT Berne 2022) and out-of-domain data (public AuDroK). Our results show that pretraining is the dominant factor for robust detection, yielding substantial TPR improvements over training from scratch on all benchmarks. The full augmentation chain provides additional gains on acoustically mismatched out-of-domain data, achieving the best mean TPR on the AuDroK subsets and the largest improvements on the most challenging scenarios. We further validate real-world applicability by measuring false positives on public non-drone corpora (IDMT-TRAFFIC and ESC-50), demonstrating equally low FPR on unfamiliar backgrounds. A distance-dependent analysis on IDMT Berne 2022 shows effective detection at distances up to 150 m.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization</title>
  <link>https://arxiv.org/abs/2510.20853</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2510.20853v2 Announce Type: replace Abstract: Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i)~insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii)~task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition</title>
  <link>https://arxiv.org/abs/2603.10468</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.10468v2 Announce Type: replace Abstract: We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Prior Speech-LLM systems tend to prioritize either local diarization or global labeling, lacking the ability to jointly model fine-grained temporal boundaries and robust cross-chunk identity linking. We propose G-STAR, an end-to-end framework that couples a cache-conditioned speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Under chunk-wise decoding protocols, experiments on both oracle-segmented local evaluation and full-meeting global evaluation show strong speaker-attributed transcription performance.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems</title>
  <link>https://arxiv.org/abs/2605.24863</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.24863v2 Announce Type: replace Abstract: Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.</description>
  <dc:source>Systems/eess.AS_(Audio_and_Speech_Processing)</dc:source>
</item>
<item>
  <title>MoE-dqINR: A Unified Mixture-of-Experts Implicit Neural Representation Framework for Scan-Specific Dynamic and Quantitative MRI Reconstruction</title>
  <link>https://arxiv.org/abs/2605.31302</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31302v1 Announce Type: new Abstract: Undersampled magnetic resonance imaging (MRI) reconstruction seeks to recover temporally or contrast-varying image series from incomplete multicoil k-space data while preserving state-dependent fidelity for dynamic and quantitative MRI (qMRI). Existing scan-specific implicit neural representations (INRs) often use monolithic spatiotemporal coordinate fields, explicit subspaces, motion or deformation models, calibration variables, or sequence-specific quantitative signal models. These design choices can limit flexibility in sharing spatial information while adapting image synthesis across acquisition states. Moreover, many INR-based baselines remain computationally demanding, typically requiring per-scan optimization times on the order of hundreds to thousands of seconds. We propose MoE-dqINR, a scan-specific multicoil MRI reconstruction framework that factorizes the image-domain representation into shared spatial experts and a state-conditioned routing pathway. Spatial experts encode reusable coordinate-dependent image content, whereas routing weights, conditioned on ordered acquisition states, synthesize each dynamic frame or contrast state from a common expert bank. The representation is coupled to a multicoil MRI forward model, uses the normalized state index to drive routing in both dynamic and quantitative MRI. By separating shared spatial representation from state-dependent synthesis, the framework provides an image-first architecture for dynamic and quantitative MRI while reducing scan-specific INR optimization to approximately 30 s per scan in our experiments. The proposed formulation establishes state-conditioned mixture-of-experts INR as a scan-specific multicoil MRI reconstruction prior that unifies shared spatial representation, dynamic- and qMRI-specific synthesis, and practical per-scan efficiency.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Self-Tuning Regularization for Image Scanning Microscopy</title>
  <link>https://arxiv.org/abs/2605.31426</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31426v1 Announce Type: new Abstract: Image Scanning Microscopy (ISM) is a fluorescence imaging technique that combines detector-array acquisition and computational reconstruction to achieve the theoretical resolution of an ideal confocal microscope, i.e., one operating with an infinitesimally small pinhole, while maintaining high signal-to-noise ratio. Among the reconstruction methods for obtaining the super-resolved image, multi-image deconvolution (MID) and its extension aimed at preserving the optical sectioning capability of confocal microscopy, known as super-resolution sectioning ISM (s$^2$ISM), are among the most widely used approaches. Both methods rely on Richardson--Lucy-type iterative schemes, whose semi-convergent behavior requires early stopping and often leads to noise amplification and reconstruction artifacts. In this work, we introduce a self-tuning explicit regularization framework for both MID and s$^2$ISM reconstruction. Within a Bayesian maximum a posteriori formulation, we combine a multi-frame Poisson data fidelity term with explicit regularization, considering $\ell_1$ and smoothed total variation penalties as representative examples. We further develop an automatic and ground-truth-free strategy for regularization parameter selection by adapting the residual whiteness principle to the multi-frame Poisson setting and introducing a spectral high-pass extension tailored to s$^2$ISM. The resulting framework enables stable reconstructions without empirical stopping rules. To demonstrate the proposed framework, we consider first-order optimization schemes based on proximal gradient and mirror descent methods with adaptive backtracking strategies. Experiments on simulated and real fluorescence ISM datasets demonstrate improved reconstruction stability and image quality with respect to unregularized approaches, while enabling robust super-resolution and optical sectioning in low-photon conditions.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A Novel Computer Vision Approach for Assessing Fish Responses to Intrusive Objects in Aquaculture</title>
  <link>https://arxiv.org/abs/2605.30399</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30399v1 Announce Type: cross Abstract: The aquaculture industry needs to address several challenges to secure sustainable seafood production that can serve an increasing global demand. One major challenge is to ensure good fish health and acceptable welfare during production since the improvement of fish welfare is of vital importance in current and future production systems. In this study, this is addressed by developing and implementing methods to identify fish behaviors in response to intrusive objects both on individual and on a group basis. A novel approach for detecting, tracking, and estimating the 3D position of individual fish has thus been developed, and specifically designed to track the caudal fins of farmed fish in industrial sea cages. The tracking data was subjected to a novel stereo-vision method adapted to estimate fish positions, velocities, accelerations, and turning and pitch angles. Datasets obtained from industrial-scale fish farms were then analyzed to identify the impact of structures of varying shapes, sizes, and colors on fish behavior. The method was trained using manually labeled caudal fins, and used YOLOv8 with ByteTrack as an object detector and tracker, SuperGlue for matching detections in the left and right frames, and triangulation to reconstruct the 3D positions of the fish. Different image pre-processing and augmentation methods for enhancing object detection accuracy were tested and their performance compared, while RAFT-Stereo was tested for depth estimation purposes. The obtained results both validate the method&#39;s performance against previous research efforts, and demonstrate the novelty and potential of this method in providing more insight into behavioral dynamics in sea-cages.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference</title>
  <link>https://arxiv.org/abs/2509.25269</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.25269v3 Announce Type: replace Abstract: In this work, we present and investigate the novel blind inverse problem of position-blind ptychography, i.e., ptychographic phase retrieval without any knowledge of scan positions, which then must be recovered jointly with the image. The motivation for this problem comes from single-particle diffractive X-ray imaging, where particles in random orientations are illuminated and a set of diffraction patterns is collected. If one uses a highly focused X-ray beam, the measurements would also become sensitive to the beam positions relative to each particle and therefore ptychographic, but these positions are also unknown. We investigate the viability of image reconstruction in a simulated, simplified 2-D variant of this difficult problem, using variational inference with modern data-driven image priors in the form of score-based diffusion models. We find that, with the right illumination structure and a strong prior, one can achieve reliable and successful image reconstructions even under measurement noise, in all except the most difficult evaluated imaging scenario.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications</title>
  <link>https://arxiv.org/abs/2601.22202</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2601.22202v2 Announce Type: replace Abstract: Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Rectified flow-based prediction of post-treatment brain MRI from pre-radiotherapy priors for patients with glioma</title>
  <link>https://arxiv.org/abs/2603.08385</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.08385v2 Announce Type: replace Abstract: Brain tumors result in 20 years of lost life on average. Standard therapies induce complex structural changes in the brain that are monitored through MRI. Recent developments in artificial intelligence (AI) enable conditional multimodal image generation from clinical data. In this study, we investigate AI-driven generation of follow-up MRI in patients with intracranial tumors through conditional image generation. This approach enables realistic modeling of post-radiotherapy changes, allowing for treatment optimization. The public SAILOR dataset of 25 patients was used to create a 2D rectified flow model conditioned on axial slices of pre-treatment MRI and RT dose maps. Cross-attention conditioning was used to incorporate temporal and chemotherapy data. The resulting images were validated with structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), Dice scores and Jacobian determinants. The resulting model generates realistic follow-up MRI for any time point, while integrating treatment information. Comparing real versus predicted images, SSIM is 0.88, and PSNR is 22.82. Tissue segmentations from real versus predicted MRI result in a mean Dice-S{\o}rensen coefficient (DSC) of 0.91. The rectified flow (RF) model enables up to 250x faster inference than Denoising Diffusion Probabilistic Models (DDPM). The proposed model generates realistic follow-up MRI in real-time, preserving both semantic and visual fidelity as confirmed by image quality metrics and tissue segmentations. Conditional generation allows counterfactual simulations by varying treatment parameters, producing predicted morphological changes. This capability has potential to support adaptive treatment dose planning and personalized outcome prediction for patients with intracranial tumors. Code will be available upon peer-reviewed publication at: https://github.com/SelenaIHuisman/RF-GlioPREDICT</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Absorption and Phase-Contrast Microtomography Using Direct X-ray Detection With COTS CMOS Sensors</title>
  <link>https://arxiv.org/abs/2605.29808</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29808v2 Announce Type: replace Abstract: This work presents a high-resolution X-ray microtomography system that uses commercial off-the-shelf (COTS) CMOS image sensors as direct detectors, relying on the sensor s intrinsic resolution to achieve tomographic reconstructions without optical components. The system employs a microfocus X-ray source in cone-beam geometry, enabling both absorption-contrast and propagation-based phase-contrast imaging. A dynamic flat-field correction algorithm mitigates radiation-induced degradation during long acquisitions, helping to overcome limitations of consumer-grade hardware. The setup provides voxel sizes from 3.9 micron to 5.2 micron. Phase contrast visualizes soft tissue boundaries that would be undetectable by conventional radiography. Compared to synchrotron or nanofocus systems, our solution is simpler, lower-cost, and avoids complex optics or slow scans. COTS CMOS sensors appear as a viable alternative for laboratory-scale high-resolution microtomography.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>SDF2Net: Shallow to Deep Feature Fusion Network for PolSAR Image Classification</title>
  <link>https://arxiv.org/abs/2402.17672</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2402.17672v2 Announce Type: replace-cross Abstract: Polarimetric synthetic aperture radar (PolSAR) images encompass valuable information that can facilitate extensive land cover interpretation and generate diverse output products. Extracting meaningful features from PolSAR data poses challenges distinct from those encountered in optical imagery. Deep learning (DL) methods offer effective solutions for overcoming these challenges in PolSAR feature extraction. Convolutional neural networks (CNNs) play a crucial role in capturing PolSAR image characteristics by leveraging kernel capabilities to consider local information and the complex-valued nature of PolSAR data. In this study, a novel three-branch fusion of complex-valued CNN, named the Shallow to Deep Feature Fusion Network (SDF2Net), is proposed for PolSAR image classification. To validate the performance of the proposed method, classification results are compared against multiple state-of-the-art approaches using the airborne synthetic aperture radar (AIRSAR) datasets of Flevoland and San Francisco, as well as the ESAR Oberpfaffenhofen dataset. The results indicate that the proposed approach demonstrates improvements in overallaccuracy, with a 1.3% and 0.8% enhancement for the AIRSAR datasets and a 0.5% improvement for the ESAR dataset. Analyses conducted on the Flevoland data underscore the effectiveness of the SDF2Net model, revealing a promising overall accuracy of 96.01% even with only a 1% sampling ratio.</description>
  <dc:source>Systems/eess.IV_(Image_and_Video_Processing)</dc:source>
</item>
<item>
  <title>Low-cost IoT-Based Rainfall Monitoring with Web-Based Data Access</title>
  <link>https://arxiv.org/abs/2605.30528</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30528v1 Announce Type: new Abstract: Rainfall measurement with high spatial and temporal resolution is critical for flood forecasting, drought mitigation, and disaster preparedness. Rainfall patterns are highly variable, both geographically and over time. This variability presents a significant challenge for monitoring, as rain gauges can accurately capture temporal patterns only at a single location. Furthermore, the high cost of commercial instruments restricts their widespread deployment, and rain gauge networks often fail to adequately capture the spatial heterogeneity of precipitation patterns. To address these limitations, this study introduces a low-cost IoT-based rainfall monitoring system developed upon the Low-cost Efficient Wireless Intelligent Sensor (LEWIS) platform. Four rainfall sensors were designed, developed, and deployed at different locations across the semi-arid region of the United States, in the State of New Mexico, to capture localized precipitation variability. Each sensor node integrates a rainfall detection module with an LTE-enabled microcontroller and is powered by a compact solar-battery system, ensuring autonomous and self-sufficient operation. Real-time precipitation data are transmitted to a cloud server for continuous access, visualization, and integration with early-warning frameworks. The results demonstrate that IoT-based rainfall monitoring can achieve reliable accuracy at a fraction of the cost of conventional gauges, while supporting dense deployment for microscale precipitation analysis. Comparative validation with model-based precipitation data and in situ observations shows strong agreement in the detection and timing of recorded precipitation events, highlighting the system potential for early warning, disaster risk reduction, and bias correction of remotely sensed precipitation products by filling observational gaps in under-instrumented semi-arid areas.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>On Spatial Degree-of-Freedom Analysis of Near-Field Multipath Channels for Ultra-massive MIMO Systems</title>
  <link>https://arxiv.org/abs/2605.30787</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30787v1 Announce Type: new Abstract: The transition to near-field (NF) communications in ultra-massive multiple-input multiple-output (UM-MIMO) systems fundamentally alters the spatial degrees of freedom (DoF) of wireless channels. While the NF DoF of line-of-sight (LoS) transmission channels is well-characterized in the literature, the DoF in NF multipath scenarios remains underexplored. This paper investigates the spatial DoF of NF UM-MIMO channels under practical multipath conditions. A generic DoF metric is derived by modeling multipath propagation and analyzing the resulting eigenvalue distribution based on the Green&#39; s function representation of the channel. The DoF contribution of each path is determined by the product of the effective electrical aperture and the subtended solid angle, and the total DoF is obtained through the effective union of spatially resolvable path contributions. A mapping between the eigenvalue distribution and multipath powers is further established. Numerical simulations and real-world NF channel measurements at 28-30 GHz with 720 array elements are conducted for validation in both LoS multipath and non-LoS scenarios. The results show that multipath propagation can significantly increase the spatial DoF and that the proposed metric accurately predicts the DoF of practical NF channels. The proposed framework provides a practical tool for DoF prediction and supports capacity analysis and spatial multiplexing design in future NF UM-MIMO systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Distribution-Aware Constellation Learning for Image Transmission</title>
  <link>https://arxiv.org/abs/2605.30988</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30988v1 Announce Type: new Abstract: Semantic communication has demonstrated significant potential for image transmission, especially in bandwidth-limited and low signal-to-noise ratio scenarios. However, most existing methods are based on analog transmission, which poses challenges to the compatibility with existing digital communication systems. Existing digital semantic communication methods commonly adopt conventional quadrature amplitude modulation constellations, which mismatch the empirical distribution of semantic features produced by the semantic encoder. This paper proposes a distribution-aware learnable modulation for semantic communication framework, which bridges semantic feature representations and discrete modulation through constellation learning. Specifically, a learnable constellation module, initialized with an amplitude phase shift keying geometric prior, is developed to refine the constellation geometry as a trainable codebook, enabling modulation symbols to better align with the distribution of semantic features. To enable end-to-end optimization, a two-stage training strategy is introduced, combining differentiable soft assignment with straight-through estimator. Simulation results show that the proposed framework consistently outperforms existing digital semantic communication schemes and achieves performance comparable to advanced analog methods.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Combining Cartesian and non-Cartesian acceleration techniques with SPARKLING for 1mm isotropic whole-brain MPRAGE in a minute</title>
  <link>https://arxiv.org/abs/2605.31017</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31017v1 Announce Type: new Abstract: Purpose: T1-weighted MPRAGE remains a cornerstone of clinical anatomical imaging, yet its long acquisition times constrain routine use. Established acceleration techniques, namely Parallel Imaging (PI) and Compressed Sensing (CS), tend to introduce substantial noise and blurring when pushed to high acceleration factors. Although they rely on fundamentally different redundancies, combining them synergistically remains an open challenge. Methods: The GoLF-SPARKLING framework was extended to jointly exploit two acceleration mechanisms: GRAPPA-based PI in the central k-space region and variable-density CS in the periphery, with independent acceleration factors in each zone. To preserve smooth signal evolution throughout the inversion-recovery period and avoid modulation artifacts, the acquisition trajectory was reordered accordingly. The resulting method was evaluated prospectively in vivo at 1mm isotropic resolution and benchmarked against Wave-CAIPI and Poisson-disk sampling. Results: The proposed hybrid approach produced sharper, less noisy, and more stable whole-brain images in approximately one minute than either acceleration strategy alone. Purely PI-based reconstructions were degraded by high g-factor noise, while purely CS-based reconstructions exhibited pronounced blurring. Furthermore, this method yielded lower average volumetric errors in downstream automated brain segmentation than state-of-the-art acceleration techniques, demonstrating its clinical utility. Conclusion: By jointly leveraging PI and CS, GoLF-SPARKLING achieves high acceleration factors that enable sub-minute, high-quality anatomical MRI. This translates into greater clinical throughput and more reliable imaging in patients who are challenging to scan.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>CRB-Optimal Arrays and Waveforms in Active Sensing: Role of Redundancy and Spatial Covariance of Array Geometry</title>
  <link>https://arxiv.org/abs/2605.31059</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31059v1 Announce Type: new Abstract: This paper characterizes the performance limits of optimal array designs using orthogonal and coherent waveforms for both linear and planar arrays. For orthogonal waveforms, we show that the single-target Cram\&#39;er-Rao Bound (CRB) depends on the sum of the so-called spatial variances of the transmit (Tx) and receive (Rx) arrays, or equivalently, the spatial variance of the sum co-array weighted by the multiplicities of the virtual sensors. This reveals that CRB-optimal geometries are inherently redundant, highlighting a fundamental trade-off between mean squared error (MSE) and identifiability in parameter estimation. Moreover, we derive optimal Tx-Rx sensor allocations given a total sensor budget and show that unequal allocation (favoring the Rx) is optimal even for nonredundant arrays, questioning conventional designs. We extend our results to planar arrays, providing a new general condition that the spatial covariances of the Tx and Rx arrays should satisfy for the optimal waveforms to direct power in the target direction. Additionally, we establish a connection between Diophantine equations and array geometries with equal CRB, along with a constructive method for designing such arrays. Our work provides new guidelines for and insights into optimal array and waveform design with relevance in emerging active sensing multiple-input multiple-output systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>DRIFT: Joint Channel Estimation and Prediction Towards Pilotless 6G Non-Terrestrial Networks</title>
  <link>https://arxiv.org/abs/2605.31065</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31065v1 Announce Type: new Abstract: Non-terrestrial networks (NTNs) are expected to play a pivotal role in sixth-generation (6G) systems by enabling ubiquitous connectivity and massive communication. In this context, channel prediction emerges as a key technique to improve the spectrum utilization efficiency by limiting the pilot overhead. However, many proposed predictors based on artificial intelligence (AI) are characterized by high inference complexity, posing challenges to onboard implementation. In this paper, we address the challenge of designing accurate yet computationally efficient channel prediction techniques tailored to low Earth orbit (LEO) NTNs, where strict power constraints limit model complexity, to enable spectral efficiency gains. We propose an iterative joint channel estimation and prediction framework in the context of 6G NTNs that significantly reduces pilot overhead by transmitting pilots only in the initial slot and relying on data-driven processing for subsequent slots. We introduce Data-driven Refinement and Iterative Forecast for wireless channel Tracking (DRIFT), a lightweight architecture that refines data-aided channel estimates and predicts future channel frequency responses with low computational cost and reduced error propagation. Two predictor variants based on convolutional and long short-term memory layers are investigated. Simulation results in an end-to-end simulation of an uplink LEO NTN scenario show that the proposed approach achieves up to 12% spectral efficiency gain compared to conventional pilot-based systems, with robustness to training-test mismatches and consistent performance across different channel models. Moreover, DRIFT requires fewer than 200k multiply-accumulate operations, making it suitable for on-board satellite implementation under stringent power constraints.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Impact of Phase Errors on Distributed NTN Beam Focusing</title>
  <link>https://arxiv.org/abs/2605.31180</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31180v1 Announce Type: new Abstract: This paper investigates distributed beam focusing for coordinated satellite constellations with phased arrays, motivated by future non-terrestrial network (NTN) systems. A geometric and channel model is developed by incorporating satellite positions, array orientations, antenna directivity, and polarization effects. Under ideal synchronization, the achievable coherent combining gain is analyzed for different constellation geometries, showing that maximum ratio transmission (MRT) enables quadratic scaling of the received power with the number of satellites. The impact of phase errors caused by residual synchronization, timing, mobility, and localization mismatches is then investigated. Closed-form expressions for the average coherent gain are derived for uniformly distributed timing offsets, demonstrating the transition from coherent to non-coherent combining. The results show that synchronization and timing mismatches reduce the coherent combining gain, while geometry dependent effects govern the resulting spatial focusing behavior. Numerical results further show that linear and circular constellations provide different focusing characteristics and spatial separation capabilities. However, MRT-based focusing results in strong sidelobes and limited spatial division capability, motivating the need for joint analog beamforming and digital precoding optimization to improve spatial selectivity and robustness against mobility and localization errors.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Super-Resolution Experimental Validation and Polarimetric Extension of the Effective Roughness Diffuse Scattering Models</title>
  <link>https://arxiv.org/abs/2605.31267</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31267v1 Announce Type: new Abstract: The experimental validation of diffuse scattering models has long been limited by the inability to spatially separate specular and diffuse contributions in measured channels. This paper overcomes this limitation by combining super-resolution multipath component (MPC) extraction, which resolves individual propagation paths including the specular component, with digital-twin-assisted geometry, enabling the spatial separation of specular and diffuse contributions from bistatic measurements at 28~GHz. Using this framework, we provide the first measurement-driven validation of the Effective Roughness (ER) model with independent characterization of diffuse scattering across ten common building materials, each measured over 266 angular configurations and all polarization combinations (HH, HV, VH, VV). Furthermore, we extend the ER framework by proposing a novel angle-dependent cross-polarization discrimination (XPD) model, capturing the geometry-dependent nature of depolarization that is neglected in existing approaches. The proposed method reproduces the measured diffuse power trends, achieving RMSE values as low as 3 dB across the tested materials, and improves XPD prediction over the baseline constant-XPD model for nearly all material-polarization cases. These results establish a physically consistent and practically viable approach for high-fidelity channel modeling in mmWave systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Practical Cross-Band Channel Prediction for AI-RAN via Physics-Guided Deep Unfolding</title>
  <link>https://arxiv.org/abs/2605.31279</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31279v1 Announce Type: new Abstract: To make cross-band channel prediction practical for AI-native RAN, algorithms must generalize across diverse environments and support real-time inference. Existing approaches achieve one but not both. To bridge this gap, we introduce GUIDE, a physics-guided deep unfolding framework that embeds wireless channel physics into differentiable layers. Without retraining in unseen environments, GUIDE achieves 2.75x beamforming gain than the deep learning-based baseline FIRE with only a slight increase in inference time, and 1.39x beamforming gain than the strongest model-based baseline R2F2 while running over 1610x faster.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>ISAC-Enabled Grant-Free Uplink via Artificial-Path Delay Modulation</title>
  <link>https://arxiv.org/abs/2605.31366</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31366v1 Announce Type: new Abstract: This paper proposes an integrated sensing and communication (ISAC)-enabled grant-free uplink framework based on artificial-path delay modulation. A grant-free user equipment (g-UE) conveys uplink information by modulating the delay of a controllable artificial path derived from the scheduled downlink waveform. In contrast to conventional superposition-based schemes with successive interference cancellation, the proposed method enables uplink-downlink coexistence in the delay-sensing domain. By introducing a single weak artificial path confined within the cyclic prefix (CP), the g-UE allows the access point (AP) to decode uplink symbols from CSI perturbations while causing only limited degradation to the scheduled user equipment (s-UE) in the downlink. To support reliable finite-alphabet delay detection under unknown path gain and off-grid leakage, we develop a baseline delay calibration procedure and a normalized matched-filter detector. Results show that reflection power determines the reliability trade-off between the g-UE and the s-UE, whereas the delay step mainly controls the g-UE reliability-efficiency trade-off with little additional impact on the downlink s-UE. Even with an artificial path 15 dB weaker than the scheduled downlink signal, the g-UE achieves lower BER than the s-UE at an effective modulation order of 16-QAM. The proposed framework thus offers a low-complexity, SIC-free, and downlink-friendly solution for grant-free uplink in ISAC systems.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Perceptual-Quality based AMC for Enhanced mmWave Spectral Efficiency: Concept and Experiment</title>
  <link>https://arxiv.org/abs/2605.31499</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31499v1 Announce Type: new Abstract: For high-throughput applications such as ultra-high-definition video streaming and immersive extended-reality, perceptual quality rather than bit-level accuracy defines the primary performance criterion and provides a more informative and spectrally efficient objective than strict bitwise reconstruction. This is particularly relevant in millimeter-wave (mmWave) and sub-Terahertz (sub-THz) systems, where path loss, short channel coherence times and phase noise introduce severe fluctuations that degrade link spectral efficiency. We propose an extension to conventional Adaptive Modulation and Coding (AMC) framework that incorporates perceptual quality awareness into link adaptation. In this framework, the decision metric is a Perceptual Quality Indicator (PQI) derived from the Structural Similarity Index Measure (SSIM). The receiver employs a Denoising Convolutional Neural Network (DnCNN) denoiser to enhance post-decoding image quality before feedback estimation. The resulting perceptual metric replaces the standard Channel Quality Indicator (CQI) in the AMC loop, enabling adaptation to maximize spectral efficiency while satisfying a perceptual-fidelity constraint. Experiments on a 5G-compliant mmWave testbed demonstrate up to a twofold gain in spectral efficiency while maintaining perceptual fidelity, underscoring the potential of perception-optimized link adaptation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Cooperative Uplink Channel Estimation in User-Centric Cell-free Massive MIMO Communication Networks</title>
  <link>https://arxiv.org/abs/2605.31510</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31510v1 Announce Type: new Abstract: Cell-free massive multi-input-multi-output (CFmMIMO) communication networks aim to provide uniform quality of service by distributing access points (APs) across a coverage area. In user-centric variants, each user equipment (UE) can choose a cluster of APs with the best channel conditions (e.g., the closest APs) for accessing service. This approach eliminates the notion of cells with dedicated regions and APs, as found in cellular mMIMO communication networks. Estimating uplink channels between UEs and APs is a crucial step in CFmMIMO communication networks; however, existing channel estimation (CE) approaches typically originate from mMIMO systems without considering the unique properties of CFmMIMO communication networks. For instance, shorter AP-UE distances in CFmMIMO systems result in Rician channel models with prominent line of sight (LoS) components between APs and UEs, motivating cooperation between APs for improved performance. In this paper, we propose a cooperative minimum-mean-squared-error (MMSE)-based uplink CE approach where APs share their linearly compressed signals as fused signals with other APs in the same cluster. The proposed approach is optimal, i.e., its performance is equivalent to that of the centralized CE approach, where APs share their uncompressed raw signals. Notably, this optimality is achieved in one shot; that is, given the required correlation matrices, the optimal fusion filters and estimators are derived non-iteratively. Consequently, the proposed approach guarantees lower communication overhead for cooperative CE compared to the centralized approach. Numerical experiments corroborate the superior performance of the proposed cooperative CE approaches in terms of CE accuracy and convergence rate.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification</title>
  <link>https://arxiv.org/abs/2605.30387</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30387v1 Announce Type: cross Abstract: Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Distributionally Robust Physical-Layer Security for Satellite Communication via Aerial Reconfigurable Intelligent Surface</title>
  <link>https://arxiv.org/abs/2605.31526</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31526v1 Announce Type: cross Abstract: Satellite communications are envisioned as a key enabler for ubiquitous coverage in future 6G networks, yet the broadcast nature renders them vulnerable to eavesdropping, especially given the long-distance transmissions and associated high uncertainties. In this paper, we propose the physical layer security enhancement for multi-beam satellite communications with the assistance of an aerial reconfigurable intelligent surface (ARIS). Considering the high dynamics and uncertainties of channels, we characterize the channel distribution with moment-based ambiguity sets. Accordingly, a distributionally robust secrecy rate optimization is formulated through joint design of transmit and reflection beamforming. We then introduce a conditional value-at-risk-based reformulation to convert the probabilistic constraints into deterministic forms. An alternating optimization framework is subsequently employed to iteratively update the transmit and reflective beamforming vectors until convergence. Simulation results demonstrate that the proposed distributionally robust scheme significantly enhances secrecy performance, and maintains reliable performance across various channel error distributions.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Microwave Linear Analog Computer (MiLAC) for Simultaneous Active and Passive Beamforming</title>
  <link>https://arxiv.org/abs/2605.31549</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.31549v1 Announce Type: cross Abstract: Microwave linear analog computers (MiLACs) have recently emerged to enable high-performance and efficient beamforming in the analog domain. In this paper, we introduce a dual-functionality framework for MiLAC-aided transceivers. Beyond analog-domain precoding/combining (active beamforming), a MiLAC and its antenna array can simultaneously act as a reconfigurable intelligent surface (RIS) (passive beamforming). This allows the MiLAC to execute beamforming for transmission/reception while reflecting external incident signals. We provide an optimal reconfiguration strategy for this dual-functional MiLAC, and characterize the fundamental limits on the trade-off between active and passive rate, namely the capacity region bounds and the sum-rate capacity.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Near-Field Position and Orientation Tracking With Hybrid ELAA Architecture</title>
  <link>https://arxiv.org/abs/2512.17274</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2512.17274v2 Announce Type: replace Abstract: This paper investigates near-field (NF) position and orientation tracking of a multi-antenna mobile station (MS) using an extremely large antenna array (ELAA)-equipped base station (BS) with a limited number of radio frequency (RF) chains. Under this hybrid array architecture, the received uplink pilot signal at the BS is first combined by analog phase shifters, producing a low-dimensional observation before digital processing. Such analog compression provides only partial access to the ELAA measurement, making it essential to design an analog combiner that can preserve pose-relevant signal components despite channel uncertainty and unit-modulus hardware constraints. To address this, we propose a predictive analog combining-assisted extended Kalman filter (PAC-EKF) framework, where the analog combiner can leverage the temporal correlation in the MS pose variation to capture the most informative signal components predictively. We then analyze fundamental performance limits via Bayesian Cram\&#39;er-Rao bound and Fisher information matrix, explicitly quantifying how the analog combiner, array size, signal-to-noise ratio, and MS pose influence the pose information contained in the uplink observation. Building on these insights, we develop two methods for designing a low-complexity analog combiner. Numerical results show that the proposed predictive analog combining approach significantly improves tracking accuracy, even with fewer RF chains and lower transmit power.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Near-Field Multi-User Communications via Polar-Domain Beamfocusing: Analytical Framework and Performance Analysis</title>
  <link>https://arxiv.org/abs/2512.17283</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2512.17283v2 Announce Type: replace Abstract: As wireless systems evolve toward higher frequencies and extremely large antenna arrays, near-field (NF) propagation becomes increasingly dominant. Unlike far-field (FF) communication, which relies on a planar-wavefront model and is limited to angular-domain beamsteering, NF propagation exhibits spherical wavefronts that enable beamfocusing in both angle and distance, i.e., the polar domain, offering new opportunities for spatial multiple access. This paper develops an analytical stochastic geometry (SG) framework for a multi-user system assisted by polar-domain beamfocusing, which jointly captures NF propagation characteristics and the spatial randomness of user locations. The intrinsic coupling between angle and distance in the NF antenna pattern renders inter-user interference analysis intractable. To address this challenge, we propose a tractable near-field multi-level antenna pattern (NF-MLAP) approximation, which enables computationally efficient expressions and tight upper bounds for key performance metrics, including coverage probability, spectrum efficiency, and area spectrum efficiency. Analytical and simulation results demonstrate that the proposed framework accurately captures performance trends and reveals fundamental trade-offs between hardware configuration (including the number of antennas and radio frequency chains) and system performance (in terms of spatial resource reuse and interference mitigation).</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>ML-Assisted Bulk Resource Allocation: Custom Outage-Based Loss Function and Reliability Analysis</title>
  <link>https://arxiv.org/abs/2603.00712</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.00712v2 Announce Type: replace Abstract: Machine learning (ML)-assisted outage-based resource allocation has recently emerged as an effective alternative to conventional scheduling methods in reliability-critical wireless systems. However, existing approaches are fundamentally limited to single-resource allocation, whereas modern and emerging systems increasingly require the simultaneous allocation of multiple resources to meet aggregate rate and reliability constraints. In this paper, we extend outage-based learning to the bulk resource allocation regime, where a user requires at least $D$ reliable resources from a pool of $R$ candidates. We first introduce a practical allocation policy, termed gate + top-$D$ allocation (GTBA), which combines threshold-based admission control with ranking-based selection. We then propose a novel ranking-aware bulk outage loss (RBOL) that provides a differentiable surrogate for the bulk outage event induced by GTBA, explicitly accounting for both gate failures and ranking errors near the selection boundary. An exact reliability analysis is developed, establishing a decomposition of bulk outage probability (BOP), identifying dominant failure mechanisms and deriving an oracle lower bound that characterizes the fundamental performance limit. Extensive simulations under balanced, light and heavy stress regimes demonstrate that RBOL consistently outperforms conventional pointwise losses and baselines, achieving substantial reductions in BOP and remaining significantly closer to the oracle bound across a wide range of operating conditions. These results confirm that set-level ranking-aware training objectives are essential for reliable ML-assisted bulk resource allocation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Enhanced Direction-Sensing Methods and Performance Analysis in Low-Altitude Wireless Network via a Rotating Antenna Array</title>
  <link>https://arxiv.org/abs/2603.20784</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2603.20784v5 Announce Type: replace Abstract: Due to the directive property of each antenna element, the received signal power can be severely attenuated when the emitter deviates from the array boresight, which will lead to a severe degradation in sensing performance along the corresponding direction. Although existing rotatable array sensing methods such as recursive rotation (RR-Root-MUSIC) can mitigate this issue by iteratively rotating and sensing, several mechanical rotations and repeated eigendecomposition operations are required to yield a high computational complexity and low time-efficiency. To address this problem, a pre-rotation initialization with recieve power as a rule is proposed to signifcantly reduce the computational complexity and improve the time-efficiency. Using this idea, a low-complexity enhanced direction-sensing framework with pre-rotation initialization and iterative greedy spatial-spectrum search (PRI-IGSS) is develped with three stages: (1) the normal vector of array is rotated to a set of candidates to find the opimal direction with the maximum sensing energy with the corresponding DOA value computed by the Root-MUSIC algorithm; (2) the array is mechanically rotated to the initial estimated direction and kept fixed; (3) an iterative greedy spatial-spectrum search or recieving beamforming method, moviated by reinforcement learning, is designed with a reduced search range and making a summation of all previous sampling variance matrices and the current one is adopted to provide an increasiong performance gain as the iteration process continues. To assess the performance of the proposed method, the corresponding CRLB is derived with a simplified rotation model. Simulation results demonstrate that the proposed PRI-IGSS method performs much better than RR-Root-MUSIC and achieves the CRLB in term of mean squared error due to the fact there is no sample accumulation for the latter.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Array Zooming Optimization for Near-Field Localization With Movable Antennas</title>
  <link>https://arxiv.org/abs/2604.27352</link>
  <pubDate>Mon, 01 Jun 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2604.27352v2 Announce Type: replace Abstract: The emergence of movable antenna (MA) technology provides a promising way to enhance wireless sensing and communication by introducing spatial degrees of freedom through dynamic array reconfiguration. In near-field localization, achieving high resolution at low cost necessitates the adoption of sparse arrays. However, such sparsity tends to introduce spatial ambiguity due to aliasing effects. To resolve this resolution-ambiguity dilemma, this paper proposes an MA-enabled array zooming (AZ) system. First, we design a multi-measurement array zooming system that dynamically adjusts antenna spacings. By fusing the observational information from different measurements, the proposed AZ system effectively mitigates spatial aliasing while maintaining spatial resolution. Second, to quantify the performance limits under the severe multi-modal distributions inherent in sparse near-field sensing, we theoretically analyze the false peak distribution and derive a tighter performance lower bound, which incorporates the false detection probability. Third, considering that multiple false peaks may exist in practical multi-modal distributions, we propose an optimization algorithm for the AZ system to suppress false peaks and minimize the localization error. Extensive numerical results demonstrate that the proposed AZ strategy adaptively optimizes array configurations under varying signal-to-noise ratios (SNRs), substantially outperforming both conventional fixed-spacing arrays and Cramer-Rao bound (CRB)-based AZ benchmarks in localization accuracy.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Scensory: Real-Time Robotic Olfactory Perception for Joint Identification and Source Localization</title>
  <link>https://arxiv.org/abs/2509.19318</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.19318v3 Announce Type: replace Abstract: While robotic perception has advanced rapidly in vision and touch, enabling robots to reason about indoor fungal contamination from weak, diffusion-dominated chemical signals remains an open challenge. We introduce Scensory, a learning-based robotic olfaction framework that simultaneously identifies fungal species and localizes their source from short time series measured by affordable, cross-sensitive VOC sensor arrays. Temporal VOC dynamics encode both chemical and spatial signatures, which we decode through neural networks trained on robot-automated data collection with spatial supervision. Across five fungal species, Scensory achieves up to 89.85% species accuracy and 87.31% source localization accuracy under ambient conditions with 3-7s sensor inputs. These results demonstrate real-time, spatially grounded perception from diffusion-dominated chemical signals, enabling scalable and low-cost source localization for robotic indoor environmental monitoring.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Theoretical Validation of the Latent Optimally Partitioned-$\ell_2/\ell_1$ Penalty with Application to Angular Power Spectrum Estimation</title>
  <link>https://arxiv.org/abs/2509.13745</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2509.13745v4 Announce Type: replace Abstract: This paper demonstrates that, in both theory and practice, the latent optimally partitioned (LOP)-$\ell_2/\ell_1$ penalty is effective for exploiting block-sparsity without knowledge of the concrete block structure. More precisely, we first present a novel theoretical result showing that the optimized block partition in the LOP-$\ell_2/\ell_1$ penalty satisfies a condition required for accurate recovery of block-sparse signals. Motivated by this result, we present a new application of the LOP-$\ell_2/\ell_1$ penalty to estimation of angular power spectrum, which is block-sparse with unknown block partition, in MIMO communication systems. Numerical simulations show that the proposed use of block-sparsity with the LOP-$\ell_2/\ell_1$ penalty significantly improves the estimation accuracy of the angular power spectrum.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Bj\&quot;orck Sequences: Extension to Arbitrary Lengths, Correlation Analysis, and Applications to Wireless Systems</title>
  <link>https://arxiv.org/abs/2506.00706</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2506.00706v2 Announce Type: replace Abstract: In this paper, we propose a sequence construction framework that extends prime-length Bj\&quot;orck sequences, a class of Constant Amplitude Zero Autocorrelation (CAZAC) sequences, to arbitrary lengths using Goldbach&#39;s conjecture for even and odd integers. The framework is generic and applies to any CAZAC family defined for prime lengths and supports extensions to both cyclically shifted sequences and sequences with different root indices. We analytically characterize the resulting correlation behavior and show that the construction preserves orthogonality among cyclic shifts while maintaining favorable zero-lag cross-correlation across different root-index sequences. We further investigate Bj\&quot;orck sequences as candidates for reference signals in next-generation wireless systems. Using the proposed framework, we extend Bj\&quot;orck sequences to arbitrary lengths and evaluate their time- and frequency-offset estimation performance in terrestrial (TNs) and non-terrestrial networks (NTNs). Results show performance comparable to Zadoff--Chu (ZC) sequences in low-Doppler TN environments and improved robustness in high-Doppler NTN scenarios due to superior ambiguity-function properties. We also identify an inherent Doppler-dependent behavior that can cause sequence misidentification under large Doppler shifts. To address this, we propose two mitigation strategies: (i) leveraging coarse Doppler estimates prior to detection, and (ii) selecting appropriately spaced subsets of orthogonal sequences. Ambiguity function-based analysis demonstrates the effectiveness of these approaches in improving estimation reliability. Overall, this work enables practical arbitrary-length CAZAC sequence design and establishes Bj\&quot;orck sequences as a strong alternative for reference signal design in high-Doppler environments.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Lumped-Element Electrical Model of the Human Head for Brain-Oriented Applications</title>
  <link>https://arxiv.org/abs/2605.30172</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30172v1 Announce Type: cross Abstract: In this work, we present a compact surrogate circuit for electro-quasi-static (EQS) head modeling. A three-shell geometry (brain, skull, scalp) is considered, and each layer is modeled through radial and tangential pathways, implemented as RC branches. Frequency-dependent tissue conductivity and permittivity are mapped into dispersive resistive and capacitive elements. The model is validated against a semi-analytical spherical-harmonics reference solution over multiple geometrical configurations and operating frequencies, demonstrating good agreement. Neglecting dispersion and capacitive pathways can lead to an overestimation of scalp potentials over the considered frequency range, highlighting the need for dispersive RC circuit modeling.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>REACT: A Conditioning Framework for User-Adaptive sEMG Hand Pose Estimation</title>
  <link>https://arxiv.org/abs/2605.30127</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.30127v1 Announce Type: cross Abstract: Surface electromyography (sEMG) enables continuous hand pose estimation on wearable devices, but models trained on multi-user corpora degrade on unseen individuals due to inter-user variability in anatomy and electrode placement. We propose REACT, a lightweight conditioning framework that personalizes a frozen pretrained EMG-to-pose backbone at inference time using only a handful of calibration recordings. REACT learns a compact user embedding from calibration data and applies Feature-wise Linear Modulation (FiLM) to adapt the shared encoder&#39;s feature space, requiring no gradient updates at deployment. On the large-scale EMG2POSE benchmark, REACT improves over the state-of-the-art baseline across all three generalization splits in both regression and tracking modes, reducing angular error by up to 3.9% with minimal parameter overhead and under 45 seconds of per-user calibration.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Low-Overhead Receiver Design for Data-Dependent Superimposed Training via Deep Learning</title>
  <link>https://arxiv.org/abs/2605.29995</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29995v1 Announce Type: cross Abstract: Superimposed pilot (SIP) transmission improves spectral efficiency by eliminating the dedicated pilot overhead required in orthogonal pilot (OP)-based schemes. However, SIP suffers from severe pilot-data coupling, which leads to a critical performance-complexity bottleneck at the receiver. To address this issue, this paper proposes a low-overhead transmission framework that revitalizes data-dependent superimposed training (DDST) with enhanced interference mitigation strategies. First, for quasi-static block-fading channels, an enhanced DDST receiver is developed to achieve non-iterative pilot-data decoupling by exploiting data-dependent algebraic structures. Second, to overcome the sensitivity of conventional DDST to channel variations and symbol misidentification in fast time-varying environments, a mix transmission scheme is developed. By strategically applying DDST to a subset of resource elements, the proposed scheme combines the interference-free transmission property of OP with the zero-pilot-overhead advantage of SIP, thereby improving demapping reliability and interference suppression. Furthermore, under the proposed mix scheme, a Vision Transformer-based neural receiver is designed to capture the orthogonal structure between pilots and perturbation-bearing data, as well as the underlying channel correlations, thereby relaxing the stringent quasi-static assumption required for interference disentanglement. Simulation results demonstrate that the proposed framework achieves significant performance gains in the low-to-medium SNR regime under time-varying channels while providing superior computational efficiency compared with state-of-the-art SIP receivers.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy</title>
  <link>https://arxiv.org/abs/2605.29975</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29975v1 Announce Type: cross Abstract: We present a fully convolutional denoising autoencoder (FC-DAE) for denoising two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). Unlike conventional denoising autoencoders that are typically restricted to fixed input sizes, the FC-DAE accepts inputs of arbitrary dimensions while preserving correlation structures across diverse dynamical regimes. The model is trained using experimentally derived $C_2$ data collected at NSLS-II beamlines, with data augmentation applied to expand the diversity of the dataset and reduce overfitting. The FC-DAE successfully recovers intricate dynamical features in low signal-to-noise conditions while maintaining structural fidelity. To assess reconstruction reliability, we employ quantitative metrics to evaluate structural fidelity and identify potential model-induced bias. Our results demonstrate that the FC-DAE provides robust denoising performance with high computational efficiency, enabling recovery of XPCS dynamics under photon-limited and low-dose measurement conditions.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>On the Effect of Pulse Shaping Filters in Zak-OTFS Waveform for Radar Sensing</title>
  <link>https://arxiv.org/abs/2605.29824</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29824v1 Announce Type: cross Abstract: In radar sensing, the self-ambiguity function of the probing waveform plays a crucial role in the resolvability and detection of multiple targets. In the recent Zak-OTFS based radar literature, Gaussian pulse shaping filter has been considered, and it has been shown to offer better range/velocity estimation performance compared to the traditional chirp waveform in scenes with multiple targets. While the self-ambiguity function with Gaussian filter has very low side lobes, its main lobe is wide which compromises resolvability and performance. Motivated by this, we seek filters with better ambiguity characteristics. Specifically, we explore two other known filters, namely, sinc and Gaussian-sinc (GS) filters, and demonstrate that these filters offer better performance compared to Gaussian filter under different scenarios and receiver processing. Towards demonstrating this, we derive closed-form expressions for the self-ambiguity functions of Zak-OTFS waveform with sinc and GS filters. The ambiguity functions of sinc and GS filtered waveforms have narrow main lobes, resulting in better resolvability in scenes with densely populated targets for the basic peak-detection based receiver. The ambiguity function of Gaussian filtered waveform has very low sidelobes, resulting in better performance in sparsely populated scenes. When a receiver with inter-target interference mitigation is used, the sinc and GS filters perform better in both dense and sparsely populated scenes compared to Gaussian filter.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>A Unified Two-Stage Generative Diffusion Framework for Channel Estimation and Port Selection in Multiuser MIMO-FAS</title>
  <link>https://arxiv.org/abs/2605.29679</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29679v1 Announce Type: cross Abstract: Fluid antenna systems (FAS) have emerged as a promising technology for next-generation wireless systems. However, practical multiuser multiple-input multiple-output FAS (MIMO-FAS) faces two inherently coupled challenges: acquiring accurate high-dimensional channel state information (CSI) from limited RF chains and solving the combinatorial port selection problem, where the effectiveness of the latter highly depends on the result of the former. In this paper, we propose a unified two-stage diffusion framework that formulates the joint task as a maximum-a-posteriori (MAP) inference problem and decomposes it into two sequential sampling stages through a plug-in approximation. For Stage I, a continuous flow-based diffusion model serves as a powerful implicit prior for 2D FAS channels, and a parallel guided generation scheme realizes approximate posterior sampling, enabling accurate multiuser channel recovery even under severely low sub-sampling ratios. For Stage II, a discrete diffusion model is trained to approximate the conditional port selection distribution by combining supervised learning on heuristic labels with reinforcement fine-tuning, effectively overcoming the local optima of conventional heuristic algorithms. Extensive simulations demonstrate that the proposed framework simultaneously achieves exceptional channel estimation accuracy and globally optimized port selection, substantially improving the minimum achievable rate.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Embodied Virtual Reality Feedback Reshapes Neural Representations to Support Continuous Three-Dimensional Motor Imagery Decoding</title>
  <link>https://arxiv.org/abs/2605.29677</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29677v1 Announce Type: cross Abstract: Continuous brain-computer interfaces (BCIs) that decode motion trajectories from imagined movement offer intuitive motor control, yet how feedback modality and longitudinal training shape neural representations and decoding performance remains poorly understood. We present the first systematic investigation of embodied virtual reality (VR) feedback during real-time 3D virtual limb control driven by motor imagery, across ten longitudinal sessions in ten participants. Performance was evaluated using three strategies: actual online performance (Fixed Decoder Generalisation, FDG), periodic retraining (Sequential Adaptive Training, SAT), and within-session upper-bound estimation (Within-Session Reconstruction, WSR). A CNN-LSTM decoder achieved within-session imagined movement correlations of r = 0.762 under VR and r = 0.672 under screen feedback. VR significantly outperformed screen feedback across all strategies and movement dimensions (improvements of 8.9-13.0%, all p &lt;= 0.002, d = 1.42-2.05). This advantage persisted under fixed decoders without retraining, demonstrating that embodied VR feedback elicits inherently more decodable and generalisable neural representations. Linear mixed-effects modelling confirmed robust main effects of feedback modality and movement axis with no interaction. Neurophysiologically, VR produced stronger sensorimotor-parietal desynchronisation and enhanced motor-frontal functional connectivity, with pervasive anterior insula engagement across all frequency bands and increased superior parietal lobule coupling, paralleling patterns associated with real movement execution. These findings establish embodied spatial feedback as a key design principle for next-generation continuous BCIs targeting intuitive motor control and neurorehabilitation.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Uni-RCM: Unified Reference-guided Cross-modal Mapping for Multi-Class Anomaly Detection</title>
  <link>https://arxiv.org/abs/2605.29455</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.29455v1 Announce Type: cross Abstract: Multi-modal industrial anomaly detection typically relies on separate models for each product category, fundamentally limiting practical scalability. When shifting to a unified paradigm that handles diverse classes simultaneously, detection accuracy often degrades due to inter-class interference and feature manifold confusion. To overcome these challenges, we propose a Unified Reference guided Cross-modal Mapping framework, named Uni-RCM. At its core, we propose a reference guide block to dynamically filter out category-specific noise by introducing a learnable reference feature, which captures the commonalities across different modalities. Besides, an offline residual quantizer is proposed to characterize the normal distribution by multiple cascaded codebooks. Extensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art performance in the challenging multi-class setting and in terms of image-level detection and pixel-level localization.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
<item>
  <title>Manifold-based Algorithms for the Hadamard Decomposition</title>
  <link>https://arxiv.org/abs/2605.28980</link>
  <pubDate>Fri, 29 May 2026 00:00:00 -0400</pubDate>
  <description>arXiv:2605.28980v1 Announce Type: cross Abstract: Given a matrix $X$, and two ranks $r_1$ and $r_2$, the Hadamard decomposition (HD) looks for two low-rank matrices, $X_1$ of rank $r_1$ and $X_2$ of rank $r_2$, both of the same size as $X$, such that $X\approx X_1\circ X_2$, where $\circ$ is the Hadamard (element-wise) product. In most cases, HD is more expressive than standard low-rank approximations such as the truncated singular value decomposition (TSVD), as it can represent higher-rank matrices with the same number of parameters; this is because the rank of $X_1 \circ X_2$ is generically equal to $r_1 r_2$. In this paper, we first present some theoretical insights for HD, in particular a useful reformulation $X\approx WH^\top$ where $W$ and $H$ have $r_1 r_2$ columns and belong to certain manifolds. These allow us to develop three new algorithms for computing HD. The first one uses the representation $X\approx X_1\circ X_2$ and relies on the Manopt toolbox. The other two rely on the reformulation $X\approx WH^\top$: one is a block projected gradient method, and the other is a manifold-based gradient descent algorithm that does not require projection onto the feasible set. The last two algorithms are particularly effective for handling large sparse data. We also propose new initializations that allow us to improve the accuracy of the HD. We compare our algorithms and initialization strategies with the TSVD and with the state of the art. Numerical results show that the new methods are efficient and competitive on both synthetic and real data.</description>
  <dc:source>Systems/eess.SP_(Signal_Processing)</dc:source>
</item>
</channel>
</rss>
