YouZum

AI

AI, Committee, News, Uncategorized

Can structural correspondences ground real world representational content in Large Language Models?

arXiv:2506.16370v1 Announce Type: new Abstract: Large Language Models (LLMs) such as GPT-4 produce compelling responses to a wide range of prompts. But their representational capacities are uncertain. Many LLMs have no direct contact with extra-linguistic reality: their inputs, outputs and training data consist solely of text, raising the questions (1) can LLMs represent anything and (2) if so, what? In this paper, I explore what it would take to answer these questions according to a structural-correspondence based account of representation, and make an initial survey of this evidence. I argue that the mere existence of structural correspondences between LLMs and worldly entities is insufficient to ground representation of those entities. However, if these structural correspondences play an appropriate role – they are exploited in a way that explains successful task performance – then they could ground real world contents. This requires overcoming a challenge: the text-boundedness of LLMs appears, on the face of it, to prevent them engaging in the right sorts of tasks.

Can structural correspondences ground real world representational content in Large Language Models? Read Post »

AI, Committee, News, Uncategorized

Techniques for supercharging academic writing with generative AI

arXiv:2310.17143v4 Announce Type: replace-cross Abstract: Academic writing is an indispensable yet laborious part of the research enterprise. This Perspective maps out principles and methods for using generative artificial intelligence (AI), specifically large language models (LLMs), to elevate the quality and efficiency of academic writing. We introduce a human-AI collaborative framework that delineates the rationale (why), process (how), and nature (what) of AI engagement in writing. The framework pinpoints both short-term and long-term reasons for engagement and their underlying mechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals the role of AI throughout the writing process, conceptualized through a two-stage model for human-AI collaborative writing, and the nature of AI assistance in writing, represented through a model of writing-assistance types and levels. Building on this framework, we describe effective prompting techniques for incorporating AI into the writing routine (outlining, drafting, and editing) as well as strategies for maintaining rigorous scholarship, adhering to varied journal policies, and avoiding overreliance on AI. Ultimately, the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.

Techniques for supercharging academic writing with generative AI Read Post »

AI, Committee, News, Uncategorized

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

arXiv:2506.16381v1 Announce Type: new Abstract: In modern speech synthesis, paralinguistic information–such as a speaker’s vocal timbre, emotional state, and dynamic prosody–plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems Read Post »

AI, Committee, News, Uncategorized

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

arXiv:2506.16633v1 Announce Type: new Abstract: Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess.

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View Read Post »

AI, Committee, News, Uncategorized

See stunning first images from the Vera C. Rubin Observatory

The first spectacular images taken by the Vera C. Rubin Observatory have been released for the world to peruse: a panoply of iridescent galaxies and shimmering nebulas. “This is the dawn of the Rubin Observatory,” says Meg Schwamb, a planetary scientist and astronomer at Queen’s University Belfast in Northern Ireland. Much has been written about the observatory’s grand promise: to revolutionize our understanding of the cosmos by revealing a once-hidden population of far-flung galaxies, erupting stars, interstellar objects, and elusive planets. And thanks to its unparalleled technical prowess, few doubted its ability to make good on that. But over the past decade, during its lengthy construction period, “everything’s been in the abstract,” says Schwamb. Today, that promise has become a staggeringly beautiful reality.  Rubin’s view of the universe is unlike any that preceded it—an expansive vision of the night sky replete with detail, including hazy envelopes of matter coursing around galaxies and star-paved bridges arching between them. “These images are truly stunning,” says Pedro Bernardinelli, an astronomer at the University of Washington. During its brief perusal of the night sky, Rubin even managed to spy more than 2,000 never-before-seen asteroids, demonstrating that it should be able to spotlight even the sneakiest denizens, and darkest corners, of our own solar system. Today’s reveal is a mere amuse-bouche compared with what’s to come: Rubin, funded by the US National Science Foundation and the Department of Energy, is set for at least 10 years of planned observations. But this moment, and these glorious inaugural images, are worth celebrating for what they represent: the culmination of over a decade of painstaking work.  “This is a direct demonstration that Rubin is no longer in the future,” says Bernardinelli. “It’s the present.” The observatory is named after the late Vera Rubin, an astronomer who uncovered strong evidence for dark matter, a mysterious and as-yet-undetected something that’s binding galaxies together more strongly than the gravity of ordinary, visible matter alone can explain. Trying to make sense of dark matter—and its equally mysterious, universe-stretching cousin, dubbed dark energy—is a monumental task, one that cannot be addressed by just one line of study or scrutiny of one type of cosmic object. That’s why Rubin was designed to document anything and everything that shifts or sparkles in the night sky. Sitting atop Chile’s Cerro Pachón mountain range, it boasts a 7,000-pound, 3,200-megapixel digital camera that can take detailed snapshots of a large patch of the night sky; a house-size cradle of mirrors that can drink up extremely distant and faint starlight; and a maze of joints and pistons that allow it to swivel about with incredible speed and precision. A multinational computer network permits its sky surveys to be largely automated, its images speedily processed, any new objects easily detected, and the relevant groups of astronomers quickly alerted. All that technical wizardry allows Rubin to take a picture of the entire visible night sky once every few days, filling in the shadowed gaps and unseen activity between galaxies. “The sky [isn’t] static. There are asteroids zipping by, and supernovas exploding,” says Yusra AlSayyad, Rubin’s overseer of image processing. By conducting a continuous survey over the next decade, the facility will create a three-dimensional movie of the universe’s ever-changing chaos that could help address all sorts of astronomic queries. What were the very first galaxies like? How did the Milky Way form? Are there planets hidden in our own solar system’s backyard? Rubin’s first glimpse of the firmament is predictably bursting with galaxies and stars. But the resolution, breadth, and depth of the images have taken astronomers aback. “I’m very impressed with these images. They’re really incredible,” says Christopher Conselice, an extragalactic astronomer at the University of Manchester in England. One shot, created from 678 individual exposures, showcases the Trifid and Lagoon nebulas—two oceans of luminescent gas and dust where stars are born. Others depict a tiny portion of Rubin’s view of the Virgo Cluster, a zoo of galaxies. Hues of blue are coming from relatively nearby whirlpools of stars, while red tints emanate from remarkably distant and primeval galaxies.  A small section of the Vera C. Rubin Observatory’s view of the Virgo Cluster. Three merging galaxies can be seen on the upper right. The view also includes two striking spiral galaxies (lower right), distant galaxies, and many Milky Way stars.NSF-DOE VERA C. RUBIN OBSERVATORY The rich detail in these images is already proving to be illuminating. “As galaxies merge and interact, the galaxies are pulling stars away from each other,” says Conselice. This behavior can be seen in plumes of diffuse light erupting from several galaxies, creating halos around them or illuminated bridges between them—records of these ancient galaxies’ pasts. Images like these are also likely to contain several supernovas, the explosive final moments of sizable stars. Not only do supernovas seed the cosmos with all the heavy elements that planets—and life—rely on, but they can also hint at how the universe has expanded over time.  Anais Möller, an astrophysicist at the Swinburne University of Technology in Melbourne, Australia, is a supernova hunter. “I search for exploding stars in very far away galaxies,” she says. Older sky surveys have found plenty, but they can lack context: You can see the explosion, but not what galaxy it’s from. Thanks to Rubin’s resolution—amply demonstrated by the Virgo Cluster set of images—astronomers can now “find where those exploding stars live,” says Möller. Another small section of the observatory’s view of the Virgo Cluster. The image includes many distant galaxies along with stars from our own Milky Way galaxy. NSF-DOE VERA C. RUBIN OBSERVATORY While taking these images of the distant universe, Rubin also discovered 2,104 asteroids flitting about in our own solar system—including seven whose orbits hew close to Earth’s own. This number may sound impressive, but it’s just par for the course for Rubin. In just a few months, it will find over a million new asteroids—doubling the current known tally. And over the course of its decadal survey, Rubin is

See stunning first images from the Vera C. Rubin Observatory Read Post »

AI, Committee, News, Uncategorized

Why Generalization in Flow Matching Models Comes from Approximation, Not Stochasticity

Introduction: Understanding Generalization in Deep Generative Models Deep generative models, including diffusion and flow matching, have shown outstanding performance in synthesizing realistic multi-modal content across images, audio, video, and text. However, the generalization capabilities and underlying mechanisms of these models are challenging in deep generative modeling. The core challenge includes understanding whether generative models truly generalize or simply memorize training data. Current research reveals conflicting evidence: some studies show that large diffusion models memorize individual samples from training sets, while others show clear signs of generalization when trained on large datasets. This contradiction points to a sharp phase transition between memorization and generalization. Existing Literature on Flow Matching and Generalization Mechanisms Existing research includes the utilization of closed-form solutions, studying memorization versus generalization, and characterizing different phases of generating dynamics. Methods like closed-form velocity field regression and a smoothed version of optimal velocity generation have been proposed. Studies on memorization relate the transition to generalization with training dataset size through geometric interpretations, while others focus on stochasticity in target objectives. Temporal regime analysis identifies distinct phases in generative dynamics, which show reliance on dimension and sample numbers. But validation methods depend on backward process stochasticity, which doesn’t apply to flow matching models, leaving significant gaps in understanding. New Findings: Early Trajectory Failures Drive Generalization Researchers from Université Jean Monnet Saint-Etienne and Université Claude Bernard Lyon provide an answer to whether training on noisy or stochastic targets improves flow matching generalization and identify the main sources of generalization. The method reveals that generalization emerges when limited-capacity neural networks fail to approximate the exact velocity field during critical time intervals at early and late phases. The researchers identify that generalization arises mainly early along flow matching trajectories, corresponding to the transition from stochastic to deterministic behaviour. Moreover, they propose a learning algorithm that explicitly regresses against the exact velocity field, showing enhanced generalization capabilities on standard image datasets. Investigating the Sources of Generalization in Flow Matching Researchers investigate the key sources of generalization. First, they challenge target stochasticity assumptions by using closed-form optimal velocity field formulations, showing that after small time values, the weighted average of conditional flow matching targets equals single expectation values. Second, they analyze the approximate quality between learned velocity fields and optimal velocity fields through systematic experiments on subsampled CIFAR-10 datasets ranging from 10 to 10,000 samples. Third, they construct hybrid models using piecewise trajectories governed by optimal velocity fields for early time intervals and learned velocity fields for later intervals, with adjustable threshold parameters to determine critical periods. Empirical Flow Matching: A Learning Algorithm for Deterministic Targets Researchers implement a learning algorithm that regresses against more deterministic targets using closed-form formulas. It compares vanilla conditional flow matching, optimal transport flow matching, and empirical flow matching across CIFAR-10 and CelebA datasets using multiple samples to estimate empirical means. Moreover, evaluation metrics include Fréchet Inception Distance with Inception-V3 and DINOv2 embeddings for a less biased assessment. The computational architecture operates with complexity O(M × |B| × d). Training configurations demonstrate that increasing sample numbers M for empirical mean computation creates less stochastic targets, leading to more stable performance improvements with modest computational overhead when M equals the batch size. Conclusion: Velocity Field Approximation as the Core of Generalization In this paper, researchers challenge the assumption that stochasticity in loss functions drives generalization in flow matching models, clarifying the critical role of exact velocity field approximation instead. While research provides empirical insights into practical learned models, precise characterization of learned velocity fields outside optimal trajectories remains an open challenge, suggesting future work to use architectural inductive biases. The broader implications include concerns about potential misuse of improved generative models for creating deepfakes, privacy violations, and synthetic content generation. So, it is necessary to give careful consideration to ethical applications. Why This Research Matters? This research is significant because it challenges a prevailing assumption in generative modeling—that stochasticity in training objectives is a key driver of generalization in flow matching models. By demonstrating that generalization instead arises from the failure of neural networks to precisely approximate the closed-form velocity field, especially during early trajectory phases, the study reframes our understanding of what enables models to produce novel data. This insight has direct implications for designing more efficient and interpretable generative systems, reducing computational overhead while maintaining or even enhancing generalization. It also informs better training protocols that avoid unnecessary stochasticity, improving reliability and reproducibility in real-world applications. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Why Generalization in Flow Matching Models Comes from Approximation, Not Stochasticity appeared first on MarkTechPost.

Why Generalization in Flow Matching Models Comes from Approximation, Not Stochasticity Read Post »

AI, Committee, News, Uncategorized

Meta AI Researchers Introduced a Scalable Byte-Level Autoregressive U-Net Model That Outperforms Token-Based Transformers Across Language Modeling Benchmarks

Language modeling plays a foundational role in natural language processing, enabling machines to predict and generate text that resembles human language. These models have evolved significantly, beginning with statistical methods and progressing through neural architectures to today’s large-scale transformer-based systems. At the center of many applications, such as chatbots, translation tools, and text completion engines, language models interpret and generate sequences of words or bytes. Their effectiveness largely depends on the underlying architecture and the data representations used. As the demand for more efficient and scalable models grows, researchers continue to explore new structures and training methods to improve performance, handle longer contexts, and reduce computational load. Among these efforts, combining ideas from convolutional architectures with autoregressive prediction has emerged as an intriguing approach. Challenges with Tokenization and Transformer-Based Language Models One of the main issues with language modeling is the excessive use of token-based models and transformer models, which are computationally expensive and generally inefficient for processing at the byte level or even across languages. Techniques such as Byte Pair Encoding control sequence lengths but create inconsistencies between languages and domains. Transformers, although precise, lack scalability due to their quadratic complexity. Although competing approaches, such as sparse attention, attempt to solve this issue, they typically do so at the expense of simplicity or performance. Byte-level modeling with flat transformers has demonstrated only partial success, underscoring the need for new architectures that can process raw byte inputs without tokenization while achieving excellent performance. Introducing AU-Net: A Token-Free Byte-Level Language Model Researchers from FAIR at Meta, TAU, INRIA, and LISN, CNRS & Université Paris-Saclay, INSA Rouen Normandy, LITIS, Rouen, France, introduced a new Autoregressive U-Net (AU-Net). This model integrates the ideas of convolutional U-Net designs with autoregressive decoding processes. In contrast to transformer systems, AU-Net does not require tokenization and works directly on bytes. The architecture is designed to enable parallel and efficient generation, with the autonomy to incorporate autoregressive capabilities. It achieves this by hierarchically encoding down-sampled convolutions and then up-sampling stages, which restore the original sequence size. Notably, AU-Net presents a splitting mechanism that enables predictions to be performed over subsegments of the sequence, enhancing scalability. This design shift also ensures that the model’s complexity increases linearly with sequence length, rather than quadratically. The researchers deployed this model across several language modeling benchmarks and multilingual tasks to test its effectiveness in both low-resource and large-scale settings. AU-Net Architecture: Multi-Scale Encoding and Parallel Inference The AU-Net architecture is implemented with multiple scale stages that reduce and then reconstruct input sequences using convolutions with strides. During training, each segment of the input sequence is predicted in a masked fashion to maintain the autoregressive property. The model uses a learned splitting function to divide input sequences into non-overlapping groups, which are then predicted concurrently and combined into a full output. It supports both shallow and deep configurations, with models ranging from 3% to 75% of the training compute budget compared to standard baselines. For example, one configuration trained on 200B tokens with 8 billion parameters achieved highly competitive results. Another version, trained on 60 billion tokens with a one billion-parameter model, achieved a 35.7 BLEU score on standard translation tasks, outperforming baseline models trained on the same data. Additionally, AU-Net demonstrated faster generation speeds due to its parallel decoding, offering a significant benefit for latency-sensitive applications. Benchmark Results Show Competitive Edge Over Transformers The experimental results showed strong performance across a wide range of tasks. On Enwik8, a byte-level compression benchmark, AU-Net achieved 1.01 bits per byte, surpassing a transformer baseline that reached only 1.02 bits per byte. On PG-19, a long-context language modeling task, the model achieved 2.61 bits per byte compared to 2.75 from standard transformers. AU-Net also scaled effectively across compute budgets, achieving 43.3 BLEU on FLORES-200 translation with an 8B model size trained on 200B tokens. In multilingual evaluation using FLORES-200, the model outperformed token-based transformers across low-resource language pairs. It also demonstrated better cross-lingual generalization within language families, achieving a BLEU score of up to 33.0 in several configurations. When evaluated under equal compute and data budgets, AU-Net either matched or outperformed transformers, with generation speeds improving by 20% to 30% in certain settings. Key Contributions and Performance Insights from AU-Net AU-Net eliminates the need for tokenization by operating directly on raw byte inputs. On Enwik8, AU-Net scored 1.01 bpb, surpassing transformer baselines with 1.02 bpb. On PG-19, it achieved 2.61 bpb, improving over the 2.75 bpb of standard transformers. FLORES-200 multilingual evaluation showed up to 33.0 BLEU, outperforming token-based systems. Byte-level models trained with AU-Net maintained high performance across high-resource and low-resource settings. Generation speed improved by 20%–30 %, supporting fast, parallel inference. Scaling laws held; performance improved with increased model size and data. The model showed better cross-lingual generalization and robustness to noise. Efficient use of compute; AU-Net matched or exceeded transformer performance at lower compute budgets. AU-Net is a viable alternative for large-scale language modeling tasks, including multilingual and byte-level applications. Conclusion: AU-Net’s Practical Benefits and Scalability Potential In conclusion, the researchers provided detailed scaling analyses showing that AU-Net adheres to predictable hyperparameter scaling laws. It benefits from increased model size and training tokens in a manner consistent with the practices observed in transformer models. For example, under compute-matched training settings, AU-Net’s performance improved steadily with increased data-to-model ratio, matching the gains seen in transformer counterparts. Importantly, AU-Net was able to scale up to models with 8 billion parameters, demonstrating effective training and showing that the architecture is capable of supporting high-capacity systems. In extended evaluations, the model maintained its efficiency when applied to downstream tasks, showing strong performance in language generation, translation, and byte-level prediction benchmarks. AU-Net also proved to be easier to train and more robust under noisy input conditions compared to token-based models. Why This Research Matters? This research matters because it challenges the long-standing reliance on token-based language models by introducing AU-Net, a byte-level autoregressive architecture that eliminates tokenization overhead while achieving competitive or superior performance. By processing raw

Meta AI Researchers Introduced a Scalable Byte-Level Autoregressive U-Net Model That Outperforms Token-Based Transformers Across Language Modeling Benchmarks Read Post »

AI, Committee, News, Uncategorized

Cloud quantum computing: A trillion-dollar opportunity with dangerous hidden risks

GUEST: Quantum computing (QC) brings with it a mix of groundbreaking possibilities and significant risks. Major tech players like IBM, Google, Microsoft and Amazon have already rolled out commercial QC cloud services, while specialized firms like Quantinuum and PsiQuantum have quickly achieved unicorn status. Experts predict that the global QC mark…Read More

Cloud quantum computing: A trillion-dollar opportunity with dangerous hidden risks Read Post »

AI, Committee, News, Uncategorized

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Multimodal LLMs: Expanding Capabilities Across Text and Vision Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants. The Challenge of Text-Only Forgetting in MLLMs However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks. Limitations of Existing Mitigation Strategies Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence. Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically. Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding. WINGS Performance Benchmarks Across Text and Multimodal Tasks In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale. Conclusion: Toward More Balanced and Generalizable MLLMs In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models Read Post »

AI, Committee, News, Uncategorized

Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration

With the frequent release of new large language models (LLMs), there is a persistent quest to minimize repetitive errors, enhance robustness, and significantly improve user interactions. As AI models become integral to more sophisticated computational tasks, developers are consistently refining their capabilities, ensuring seamless integration within diverse, real-world scenarios. Mistral AI has released Mistral Small 3.2 (Mistral-Small-3.2-24B-Instruct-2506), an updated version of its earlier release, Mistral-Small-3.1-24B-Instruct-2503. Although a minor release, Mistral Small 3.2 introduces fundamental upgrades that aim to enhance the model’s overall reliability and efficiency, particularly in handling complex instructions, avoiding redundant outputs, and maintaining stability under function-calling scenarios. A significant enhancement in Mistral Small 3.2 is its accuracy in executing precise instructions. Successful user interaction often requires precision in executing subtle commands. Benchmark scores accurately reflect this improvement: under the Wildbench v2 instruction test, Mistral Small 3.2 achieved 65.33% accuracy, an improvement from 55.6% for its predecessor. Conversely, performance in the difficult Arena Hard v2 test was almost doubled, from 19.56% to 43.1%, which provides evidence of its improved ability to execute and grasp intricate commands precisely. Image Source Correcting repetition errors, Mistral Small 3.2 greatly minimizes instances of infinite or repetitive output, a problem commonly faced in long conversational scenarios. Internal evaluations show that Small 3.2 effectively cuts instances of infinite generation errors by half, from 2.11% in Small 3.1 to 1.29%. This complete reduction directly increases the model’s usability and dependability in extended interactions. The new model also demonstrates greater capability to call functions, making it ideal for automation tasks. Also, improved robustness in the function calling template translates to more stable and dependable interactions. STEM-related benchmark improvement further demonstrates Small 3.2’s aptitude. For example, the HumanEval Plus Pass@5 code test had its accuracy increase from 88.99% in Small 3.1 to a whopping 92.90%. Also, MMLU Pro test results increased from 66.76% to 69.06%, and GPQA Diamond ratings improved slightly from 45.96% to 46.13%, showing general competence in scientific and technical uses. Image Source Vision-based performance outcomes were inconsistent, with certain optimizations being selectively applied. ChartQA accuracy improved from 86.24% to 87.4%, and DocVQA marginally enhanced from 94.08% to 94.86%. In contrast, some tests, such as MMMU and Mathvista, experienced slight dips, indicating specific trade-offs encountered during the optimization process. The key updates in Mistral Small 3.2 over Small 3.1 include: Enhanced precision in instruction-following, with Wildbench v2 accuracy rising from 55.6% to 65.33%. Reduced repetition errors, halving infinite generation instances from 2.11% to 1.29%. Improved robustness in function calling templates, ensuring more stable integrations. Notable increases in STEM-related performance, particularly in HumanEval Plus Pass@5 (92.90%) and MMLU Pro (69.06%). In conclusion, Mistral Small 3.2 offers targeted and practical enhancements over its predecessor, providing users with greater accuracy, reduced redundancy, and improved integration capabilities. These advancements help position it as a reliable choice for complex AI-driven tasks across diverse application areas. Check out the Model Card on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration appeared first on MarkTechPost.

Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration Read Post »

en_US