Feed aggregator

A new generative AI approach to predicting chemical reactions

MIT Latest News - Wed, 09/03/2025 - 3:55pm

Many attempts have been made to harness the power of new artificial intelligence and large language models (LLMs) to try to predict the outcomes of new chemical reactions. These have had limited success, in part because until now they have not been grounded in an understanding of fundamental physical principles, such as the laws of conservation of mass. Now, a team of researchers at MIT has come up with a way of incorporating these physical constraints on a reaction prediction model, and thus greatly improving the accuracy and reliability of its outputs.

The new work was reported Aug. 20 in the journal Nature, in a paper by recent postdoc Joonyoung Joung (now an assistant professor at Kookmin University, South Korea); former software engineer Mun Hong Fong (now at Duke University); chemical engineering graduate student Nicholas Casetti; postdoc Jordan Liles; physics undergraduate student Ne Dassanayake; and senior author Connor Coley, who is the Class of 1957 Career Development Professor in the MIT departments of Chemical Engineering and Electrical Engineering and Computer Science.

“The prediction of reaction outcomes is a very important task,” Joung explains. For example, if you want to make a new drug, “you need to know how to make it. So, this requires us to know what product is likely” to result from a given set of chemical inputs to a reaction. But most previous efforts to carry out such predictions look only at a set of inputs and a set of outputs, without looking at the intermediate steps or considering the constraints of ensuring that no mass is gained or lost in the process, which is not possible in actual reactions.

Joung points out that while large language models such as ChatGPT have been very successful in many areas of research, these models do not provide a way to limit their outputs to physically realistic possibilities, such as by requiring them to adhere to conservation of mass. These models use computational “tokens,” which in this case represent individual atoms, but “if you don’t conserve the tokens, the LLM model starts to make new atoms, or deletes atoms in the reaction.” Instead of being grounded in real scientific understanding, “this is kind of like alchemy,” he says. While many attempts at reaction prediction only look at the final products, “we want to track all the chemicals, and how the chemicals are transformed” throughout the reaction process from start to end, he says.

In order to address the problem, the team made use of a method developed back in the 1970s by chemist Ivar Ugi, which uses a bond-electron matrix to represent the electrons in a reaction. They used this system as the basis for their new program, called FlowER (Flow matching for Electron Redistribution), which allows them to explicitly keep track of all the electrons in the reaction to ensure that none are spuriously added or deleted in the process.

The system uses a matrix to represent the electrons in a reaction, and uses nonzero values to represent bonds or lone electron pairs and zeros to represent a lack thereof. “That helps us to conserve both atoms and electrons at the same time,” says Fong. This representation, he says, was one of the key elements to including mass conservation in their prediction system.

The system they developed is still at an early stage, Coley says. “The system as it stands is a demonstration — a proof of concept that this generative approach of flow matching is very well suited to the task of chemical reaction prediction.” While the team is excited about this promising approach, he says, “we’re aware that it does have specific limitations as far as the breadth of different chemistries that it’s seen.” Although the model was trained using data on more than a million chemical reactions, obtained from a U.S. Patent Office database, those data do not include certain metals and some kinds of catalytic reactions, he says.

“We’re incredibly excited about the fact that we can get such reliable predictions of chemical mechanisms” from the existing system, he says. “It conserves mass, it conserves electrons, but we certainly acknowledge that there’s a lot more expansion and robustness to work on in the coming years as well.”

But even in its present form, which is being made freely available through the online platform GitHub, “we think it will make accurate predictions and be helpful as a tool for assessing reactivity and mapping out reaction pathways,” Coley says. “If we’re looking toward the future of really advancing the state of the art of mechanistic understanding and helping to invent new reactions, we’re not quite there. But we hope this will be a steppingstone toward that.”

“It’s all open source,” says Fong. “The models, the data, all of them are up there,” including a previous dataset developed by Joung that exhaustively lists the mechanistic steps of known reactions. “I think we are one of the pioneering groups making this dataset, and making it available open-source, and making this usable for everyone,” he says.

The FlowER model matches or outperforms existing approaches in finding standard mechanistic pathways, the team says, and makes it possible to generalize to previously unseen reaction types. They say the model could potentially be relevant for predicting reactions for medicinal chemistry, materials discovery, combustion, atmospheric chemistry, and electrochemical systems.

In their comparisons with existing reaction prediction systems, Coley says, “using the architecture choices that we’ve made, we get this massive increase in validity and conservation, and we get a matching or a little bit better accuracy in terms of performance.”

He adds that “what’s unique about our approach is that while we are using these textbook understandings of mechanisms to generate this dataset, we’re anchoring the reactants and products of the overall reaction in experimentally validated data from the patent literature.” They are inferring the underlying mechanisms, he says, rather than just making them up. “We’re imputing them from experimental data, and that’s not something that has been done and shared at this kind of scale before.”

The next step, he says, is “we are quite interested in expanding the model’s understanding of metals and catalytic cycles. We’ve just scratched the surface in this first paper,” and most of the reactions included so far don’t include metals or catalysts, “so that’s a direction we’re quite interested in.”

In the long term, he says, “a lot of the excitement is in using this kind of system to help discover new complex reactions and help elucidate new mechanisms. I think that the long-term potential impact is big, but this is of course just a first step.”

The work was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium and the National Science Foundation.

EFF Statement on ICE Use of Paragon Solutions Malware

EFF: Updates - Wed, 09/03/2025 - 3:46pm

This statement can be attributed to EFF Senior Staff Technologist Cooper Quintin

It was recently reported by Jack Poulson on Substack that ICE has reactivated its 2 million dollar contract with Paragon Solutions, a cyber-mercenary and spyware manufacturer. 

The reactivation of the contract between the Department of Homeland Security and Paragon Solutions, a known spyware vendor, is extremely troubling.

Paragon's “Graphite” malware has been implicated in widespread misuse by the Italian government. Researchers at Citizen Lab at the Monk School of Global Affairs at the University of Toronto and with Meta found that it has been used in Italy to spy on journalists and civil society actors, including humanitarian workers. Without strong legal guardrails, there is a risk that the malware will be misused in a similar manner by the U.S. Government.

These reports undermine Paragon Solutions’s public  marketing of itself as a more ethical provider of surveillance malware. 

Reportedly, the contract is being reactivated because the US arm of Paragon Solutions was acquired by a Miami based private equity firm, AE Industrial Partners, and then merged into a Virginia based cybersecurity company, REDLattice, allowing ICE to circumvent Executive Order 14093 which bans the acquisition of spyware controlled by a foreign government or person. Even though this order was always insufficient in preventing the acquisition of dangerous spyware, it was the best protection we had. This end run around the executive order both ignores the spirit of the rule and does not actually do anything to prevent misuse of Paragon Malware for human rights abuses. Nor will it prevent insider threats at Paragon using their malware to spy on US government officials, or US government officials from misusing it to spy on their personal enemies, rivals, or spouses. 


The contract between Paragon and ICE requires all US users to adjust their threat models and take extra precautions. Paragon’s Graphite isn’t magical, it’s still just malware. It still needs a zero day exploit in order to compromise a phone with the latest security updates and those are expensive. The best thing you can do to protect yourself against Graphite is to keep your phone up to date and enable Lockdown Mode in your operating system if you are using an iPhone or Advanced Protection Mode on Android. Turning on disappearing messages is also helpful that way if someone in your network does get compromised you don’t also reveal your entire message history. For more tips on protecting yourself from malware check out our Surveillance Self Defense guides.

Related Cases: AlHathloul v. DarkMatter Group

EFF Awards Spotlight ✨ Just Futures Law

EFF: Updates - Wed, 09/03/2025 - 2:04pm

In 1992 EFF presented our very first awards recognizing key leaders and organizations advancing innovation and championing civil liberties and human rights online. Now in 2025 we're continuing to celebrate the accomplishments of people working toward a better future for everyone with the EFF Awards!

All are invited to attend the EFF Awards on Wednesday, September 10 at the San Francisco Design Center. Whether you're an activist, an EFF supporter, a student interested in cyberlaw, or someone who wants to munch on a strolling dinner with other likeminded individuals, anyone can enjoy the ceremony!

REGISTER TODAY!

GENERAL ADMISSION: $55 | CURRENT EFF MEMBERS: $45 | STUDENTS: $35

If you're not able to make it, we'll also be hosting a livestream of the event on Friday, September 12 at 12:00 PM PT. The event will also be recorded, and posted to YouTube and the Internet Archive after the livestream.

We are honored to present the three winners of this year's EFF Awards: Just Futures Law, Erie Meyer, and Software Freedom Law Center, India. But, before we kick off the ceremony next week, let's take a closer look at each of the honorees. First up—Just Futures Law, winner of the EFF Award for Leading Immigration and Surveillance Litigation:

Just Futures Law is a women-of-color-led law project that recognizes how surveillance disproportionately impacts immigrants and people of color in the United States. In the past year, Just Futures sued the Department of Homeland Security and its subagencies seeking a court order to compel the agencies to release records on their use of AI and other algorithms, and sued the Trump Administration for prematurely halting Haiti’s Temporary Protected Status, a humanitarian program that allows hundreds of thousands of Haitians to temporarily remain and work in the United States due to Haiti’s current conditions of extraordinary crises. It has represented activists in their fight against tech giants like Clearview AI, it has worked with Mijente to launch the TakeBackTech fellowship to train new advocates on grassroots-directed research, and it has worked with Grassroots Leadership to fight for the release of detained individuals under Operation Lone Star.

We're excited to celebrate Just Futures Law and the other EFF Award winners in person in San Francisco on September 10! We hope that you'll join us there.

Thank you to Fastly, DuckDuckGo, Corellium, and No Starch Press for their year-round support of EFF's mission.

Want to show your team’s support for EFF? Sponsorships ensure we can continue hosting events like this to build community among digital rights supporters. Please visit eff.org/thanks or contact tierney@eff.org for more information on corporate giving and sponsorships.

EFF is dedicated to a harassment-free experience for everyone, and all participants are encouraged to view our full Event Expectations.

Questions? Email us at events@eff.org.

🤐 This Censorship Law Turns Parents Into Content Cops | EFFector 37.11

EFF: Updates - Wed, 09/03/2025 - 1:06pm

School is back in session! Perfect timing to hit the books and catch up on the latest digital rights news. We've got you covered with bite-sized updates in this issue of our EFFector newsletter.

This time, we're breaking down why Wyoming’s new age verification law is a free speech disaster. You’ll also read about a big win for transparency around police surveillance, how the Trump administration’s war on “woke AI” threatens civil liberties, and a welcome decision in a landmark human rights case.

Prefer to listen? Be sure to check out the audio companion to EFFector! We're interviewing EFF staff about some of the important issues they are working on. This time, EFF Legislative Activist Rindala Alajaji discusses the real harms of age verification laws like the one passed in Wyoming. Tune in on YouTube or the Internet Archive.

LISTEN TO EFFECTOR

EFFECTOR 37.11 - This Censorship Law Turns Parents Into Content Cops

Since 1990 EFF has published EFFector to help keep readers on the bleeding edge of their digital rights. We know that the intersection of technology, civil liberties, human rights, and the law can be complicated, so EFFector is a great way to stay on top of things. The newsletter is chock full of links to updates, announcements, blog posts, and other stories to help keep readers—and listeners—up to date on the movement to protect online privacy and free expression. 

Thank you to the supporters around the world who make our work possible! If you're not a member yet, join EFF today to help us fight for a brighter digital future.

Indirect Prompt Injection Attacks Against LLM Assistants

Schneier on Security - Wed, 09/03/2025 - 7:00am

Really good research on practical attacks against LLM agents.

Invitation Is All You Need! Promptware Attacks Against LLM-Powered Assistants in Production Are Practical and Dangerous

Abstract: The growing integration of LLMs into applications has introduced new security risks, notably known as Promptware­—maliciously engineered prompts designed to manipulate LLMs to compromise the CIA triad of these applications. While prior research warned about a potential shift in the threat landscape for LLM-powered applications, the risk posed by Promptware is frequently perceived as low. In this paper, we investigate the risk Promptware poses to users of Gemini-powered assistants (web application, mobile application, and Google Assistant). We propose a novel Threat Analysis and Risk Assessment (TARA) framework to assess Promptware risks for end users. Our analysis focuses on a new variant of Promptware called Targeted Promptware Attacks, which leverage indirect prompt injection via common user interactions such as emails, calendar invitations, and shared documents. We demonstrate 14 attack scenarios applied against Gemini-powered assistants across five identified threat classes: Short-term Context Poisoning, Permanent Memory Poisoning, Tool Misuse, Automatic Agent Invocation, and Automatic App Invocation. These attacks highlight both digital and physical consequences, including spamming, phishing, disinformation campaigns, data exfiltration, unapproved user video streaming, and control of home automation devices. We reveal Promptware’s potential for on-device lateral movement, escaping the boundaries of the LLM-powered application, to trigger malicious actions using a device’s applications. Our TARA reveals that 73% of the analyzed threats pose High-Critical risk to end users. We discuss mitigations and reassess the risk (in response to deployed mitigations) and show that the risk could be reduced significantly to Very Low-Medium. We disclosed our findings to Google, which deployed dedicated mitigations...

Biden’s green bank is on the ropes

ClimateWire News - Wed, 09/03/2025 - 6:14am
A court ruling Tuesday could mean the end for a $20 billion fund that was meant to bring renewable energy to low-income communities.

NOAA creates powerful role for wind critic

ClimateWire News - Wed, 09/03/2025 - 6:14am
Anne Hawkins once led an activist group that challenged offshore wind projects. She now will serve as NOAA's chief of strategy.

Offshore wind is in a fight for survival

ClimateWire News - Wed, 09/03/2025 - 6:12am
The last two weeks have plunged the industry into crisis, as President Donald Trump cuts funding, revokes permits and stops projects.

Trump admin asks court to reverse New York climate superfund law

ClimateWire News - Wed, 09/03/2025 - 6:12am
The legal move marks a new phase of President Donald Trump's effort to exert control over state environmental statutes.

Cruise industry sues Hawaii over climate tax

ClimateWire News - Wed, 09/03/2025 - 6:11am
Cruise Lines International says the levy would impose “massive, novel, and unconstitutional fees” on passengers.

Trump cuts could weaken federal disaster response, GAO says

ClimateWire News - Wed, 09/03/2025 - 6:11am
FEMA has lost nearly 10 percent of its staff since the start of the year. The Army Corps of Engineers has seen a similar drop.

Scientists slam inaccuracies in DOE climate change report

ClimateWire News - Wed, 09/03/2025 - 6:10am
The group of 85 scientists also say the process of crafting the report seemed to run afoul of the Federal Advisory Committee Act.

France says EU leaders, not ministers, should decide 2040 climate target

ClimateWire News - Wed, 09/03/2025 - 6:09am
Getting the European Council involved in the bloc’s 2040 target would delay and complicate an agreement.

Zimbabwe publishes draft regulations to establish climate fund

ClimateWire News - Wed, 09/03/2025 - 6:09am
The fund will incentivize public and private entities to adopt clean energy and reduce greenhouse gas emissions.

Sudan seeks aid after landslide kills over 1,000 in single village

ClimateWire News - Wed, 09/03/2025 - 6:08am
The tragedy happened Sunday in the village, located in Central Darfur’s Marrah Mountains, after days of heavy rainfall.

Hotter and longer summers are shifting wedding season

ClimateWire News - Wed, 09/03/2025 - 6:08am
Many couples who pick the summer to get hitched are choosing venues better able to handle the heat or shifting to earlier or later dates.

3 Questions: The pros and cons of synthetic data in AI

MIT Latest News - Wed, 09/03/2025 - 12:00am

Synthetic data are artificially generated by algorithms to mimic the statistical properties of actual data, without containing any information from real-world sources. While concrete numbers are hard to pin down, some estimates suggest that more than 60 percent of data used for AI applications in 2024 was synthetic, and this figure is expected to grow across industries.

Because synthetic data don’t contain real-world information, they hold the promise of safeguarding privacy while reducing the cost and increasing the speed at which new AI models are developed. But using synthetic data requires careful evaluation, planning, and checks and balances to prevent loss of performance when AI models are deployed.       

To unpack some pros and cons of using synthetic data, MIT News spoke with Kalyan Veeramachaneni, a principal research scientist in the Laboratory for Information and Decision Systems and co-founder of DataCebo whose open-core platform, the Synthetic Data Vaulthelps users generate and test synthetic data.

Q: How are synthetic data created?

A: Synthetic data are algorithmically generated but do not come from a real situation. Their value lies in their statistical similarity to real data. If we’re talking about language, for instance, synthetic data look very much as if a human had written those sentences. While researchers have created synthetic data for a long time, what has changed in the past few years is our ability to build generative models out of data and use them to create realistic synthetic data. We can take a little bit of real data and build a generative model from that, which we can use to create as much synthetic data as we want. Plus, the model creates synthetic data in a way that captures all the underlying rules and infinite patterns that exist in the real data.

There are essentially four different data modalities: language, video or images, audio, and tabular data. All four of them have slightly different ways of building the generative models to create synthetic data. An LLM, for instance, is nothing but a generative model from which you are sampling synthetic data when you ask it a question.      

A lot of language and image data are publicly available on the internet. But tabular data, which is the data collected when we interact with physical and social systems, is often locked up behind enterprise firewalls. Much of it is sensitive or private, such as customer transactions stored by a bank. For this type of data, platforms like the Synthetic Data Vault provide software that can be used to build generative models. Those models then create synthetic data that preserve customer privacy and can be shared more widely.      

One powerful thing about this generative modeling approach for synthesizing data is that enterprises can now build a customized, local model for their own data. Generative AI automates what used to be a manual process.

Q: What are some benefits of using synthetic data, and which use-cases and applications are they particularly well-suited for?

A: One fundamental application which has grown tremendously over the past decade is using synthetic data to test software applications. There is data-driven logic behind many software applications, so you need data to test that software and its functionality. In the past, people have resorted to manually generating data, but now we can use generative models to create as much data as we need.

Users can also create specific data for application testing. Say I work for an e-commerce company. I can generate synthetic data that mimics real customers who live in Ohio and made transactions pertaining to one particular product in February or March.

Because synthetic data aren’t drawn from real situations, they are also privacy-preserving. One of the biggest problems in software testing has been getting access to sensitive real data for testing software in non-production environments, due to privacy concerns. Another immediate benefit is in performance testing. You can create a billion transactions from a generative model and test how fast your system can process them.

Another application where synthetic data hold a lot of promise is in training machine-learning models. Sometimes, we want an AI model to help us predict an event that is less frequent. A bank may want to use an AI model to predict fraudulent transactions, but there may be too few real examples to train a model that can identify fraud accurately. Synthetic data provide data augmentation — additional data examples that are similar to the real data. These can significantly improve the accuracy of AI models.

Also, sometimes users don’t have time or the financial resources to collect all the data. For instance, collecting data about customer intent would require conducting many surveys. If you end up with limited data and then try to train a model, it won’t perform well. You can augment by adding synthetic data to train those models better.

Q. What are some of the risks or potential pitfalls of using synthetic data, and are there steps users can take to prevent or mitigate those problems?

A. One of the biggest questions people often have in their mind is, if the data are synthetically created, why should I trust them? Determining whether you can trust the data often comes down to evaluating the overall system where you are using them.

There are a lot of aspects of synthetic data we have been able to evaluate for a long time. For instance, there are existing methods to measure how close synthetic data are to real data, and we can measure their quality and whether they preserve privacy. But there are other important considerations if you are using those synthetic data to train a machine-learning model for a new use case. How would you know the data are going to lead to models that still make valid conclusions?

New efficacy metrics are emerging, and the emphasis is now on efficacy for a particular task. You must really dig into your workflow to ensure the synthetic data you add to the system still allow you to draw valid conclusions. That is something that must be done carefully on an application-by-application basis.

Bias can also be an issue. Since it is created from a small amount of real data, the same bias that exists in the real data can carry over into the synthetic data. Just like with real data, you would need to purposefully make sure the bias is removed through different sampling techniques, which can create balanced datasets. It takes some careful planning, but you can calibrate the data generation to prevent the proliferation of bias.

To help with the evaluation process, our group created the Synthetic Data Metrics Library. We worried that people would use synthetic data in their environment and it would give different conclusions in the real world. We created a metrics and evaluation library to ensure checks and balances. The machine learning community has faced a lot of challenges in ensuring models can generalize to new situations. The use of synthetic data adds a whole new dimension to that problem.

I expect that the old systems of working with data, whether to build software applications, answer analytical questions, or train models, will dramatically change as we get more sophisticated at building these generative models. A lot of things we have never been able to do before will now be possible.

Soft materials hold onto “memories” of their past, for longer than previously thought

MIT Latest News - Wed, 09/03/2025 - 12:00am

If your hand lotion is a bit runnier than usual coming out of the bottle, it might have something to do with the goop’s “mechanical memory.”

Soft gels and lotions are made by mixing ingredients until they form a stable and uniform substance. But even after a gel has set, it can hold onto “memories,” or residual stress, from the mixing process. Over time, the material can give in to these embedded stresses and slide back into its former, premixed state. Mechanical memory is, in part, why hand lotion separates and gets runny over time. 

Now, an MIT engineer has devised a simple way to measure the degree of residual stress in soft materials after they have been mixed, and found that common products like hair gel and shaving cream have longer mechanical memories, holding onto residual stresses for longer periods of time than manufacturers might have assumed.

In a study appearing today in Physical Review Letters, Crystal Owens, a postdoc in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), presents a new protocol for measuring residual stress in soft, gel-like materials, using a standard benchtop rheometer.

Applying this protocol to everyday soft materials, Owens found that if a gel is made by mixing it in one direction, once it settles into a stable and uniform state, it effectively holds onto the memory of the direction in which it is mixed. Even after several days, the gel will hold some internal stress that, if released, will cause the gel to shift in the direction opposite to how it was initially mixed, reverting back to its earlier state.

“This is one reason different batches of cosmetics or food behave differently even if they underwent ‘identical’ manufacturing,” Owens says. “Understanding and measuring these hidden stresses during processing could help manufacturers design better products that last longer and perform more predictably.”

A soft glass

Hand lotion, hair gel, and shaving cream all fall under the category of “soft glassy materials” — materials that exhibit properties of both solids and liquids.

“Anything you can pour into your hand and it forms a soft mound is going to be considered a soft glass,” Owens explains. “In materials science, it’s considered a soft version of something that has the same amorphous structure as glass.”

In other words, a soft glassy material is a strange amalgam of a solid and a liquid. It can be poured out like a liquid, and it can hold its shape like a solid. Once they are made, these materials exist in a delicate balance between solid and liquid. And Owens wondered: For how long?

“What happens to these materials after very long times? Do they finally relax or do they never relax?” Owens says. “From a physics perspective, that’s a very interesting concept: What is the essential state of these materials?”

Twist and hold

In the manufacturing of soft glassy materials such as hair gel and shampoo, ingredients are first mixed into a uniform product. Quality control engineers then let a sample sit for about a minute — a period of time that they assume is enough to allow any residual stresses from the mixing process dissipate. In that time, the material should settle into a steady, stable state, ready for use.

But Owens suspected that the materials may hold some degree of stress from the production process long after they’ve appeared to settle.

“Residual stress is a low level of stress that’s trapped inside a material after it’s come to a steady state,” Owens says. “This sort of stress has not been measured in these sorts of materials.”

To test her hypothesis, she carried out experiments with two common soft glassy materials: hair gel and shaving cream. She made measurements of each material in a rheometer — an instrument consisting of two rotating plates that can twist and press a material together at precisely controlled pressures and forces that relate directly to the material’s internal stresses and strains.

In her experiments, she placed each material in the rheometer and spun the instrument’s top plate around to mix the material. Then she let the material settle, and then settle some more — much longer than one minute. During this time, she observed the amount of force it took the rheometer to hold the material in place. She reasoned that the greater the rheometer’s force, the more it must be counteracting any stress within the material that would otherwise cause it to shift out of its current state.

Over multiple experiments using this new protocol, Owens found that different types of soft glassy materials held a significant amount of residual stress, long after most researchers would assume the stress had dissipated. What’s more, she found that the degree of stress that a material retained was a reflection of the direction in which it was initially mixed, and when it was mixed.

“The material can effectively ‘remember’ which direction it was mixed, and how long ago,” Owens says. “And it turns out they hold this memory of their past, a lot longer than we used to think.”

In addition to the protocol she has developed to measure residual stress, Owens has developed a model to estimate how a material will change over time, given the degree of residual stress that it holds. Using this model, she says scientists might design materials with “short-term memory,” or very little residual stress, such that they remain stable over longer periods.

One material where she sees room for such improvement is asphalt — a substance that is first mixed, then poured in molten form over a surface where it then cools and settles over time. She suspects that residual stresses from the mixing of asphalt may contribute to cracks forming in pavement over time. Reducing these stresses at the start of the process could lead to longer-lasting, more resilient roads.

“People are inventing new types of asphalt all the time to be more eco-friendly, and all of these will have different levels of residual stress that will need some control,” she says. “There’s plenty of room to explore.”

This research was supported, in part, by MIT’s Postdoctoral Fellowship for Engineering Excellence and an MIT Mathworks Fellowship.

Moving beyond projects to achieve transformative adaptation

Nature Climate Change - Wed, 09/03/2025 - 12:00am

Nature Climate Change, Published online: 03 September 2025; doi:10.1038/s41558-025-02414-x

Projects are not delivering the transformative change needed for climate change adaptation. This failure is due in part to the delivery of adaptation as projects, but there are viable alternatives that can better address the underlying and structural causes of vulnerability.

3 Questions: On biology and medicine’s “data revolution”

MIT Latest News - Tue, 09/02/2025 - 5:45pm

Caroline Uhler is an Andrew (1956) and Erna Viterbi Professor of Engineering at MIT; a professor of electrical engineering and computer science in the Institute for Data, Science, and Society (IDSS); and director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, where she is also a core institute and scientific leadership team member. 

Uhler is interested in all the methods by which scientists can uncover causality in biological systems, ranging from causal discovery on observed variables to causal feature learning and representation learning. In this interview, she discusses machine learning in biology, areas that are ripe for problem-solving, and cutting-edge research coming out of the Schmidt Center.

Q: The Eric and Wendy Schmidt Center has four distinct areas of focus structured around four natural levels of biological organization: proteins, cells, tissues, and organisms. What, within the current landscape of machine learning, makes now the right time to work on these specific problem classes?

A: Biology and medicine are currently undergoing a “data revolution.” The availability of large-scale, diverse datasets — ranging from genomics and multi-omics to high-resolution imaging and electronic health records — makes this an opportune time. Inexpensive and accurate DNA sequencing is a reality, advanced molecular imaging has become routine, and single cell genomics is allowing the profiling of millions of cells. These innovations — and the massive datasets they produce — have brought us to the threshold of a new era in biology, one where we will be able to move beyond characterizing the units of life (such as all proteins, genes, and cell types) to understanding the `programs of life’, such as the logic of gene circuits and cell-cell communication that underlies tissue patterning and the molecular mechanisms that underlie the genotype-phenotype map.

At the same time, in the past decade, machine learning has seen remarkable progress with models like BERT, GPT-3, and ChatGPT demonstrating advanced capabilities in text understanding and generation, while vision transformers and multimodal models like CLIP have achieved human-level performance in image-related tasks. These breakthroughs provide powerful architectural blueprints and training strategies that can be adapted to biological data. For instance, transformers can model genomic sequences similar to language, and vision models can analyze medical and microscopy images.

Importantly, biology is poised to be not just a beneficiary of machine learning, but also a significant source of inspiration for new ML research. Much like agriculture and breeding spurred modern statistics, biology has the potential to inspire new and perhaps even more profound avenues of ML research. Unlike fields such as recommender systems and internet advertising, where there are no natural laws to discover and predictive accuracy is the ultimate measure of value, in biology, phenomena are physically interpretable, and causal mechanisms are the ultimate goal. Additionally, biology boasts genetic and chemical tools that enable perturbational screens on an unparalleled scale compared to other fields. These combined features make biology uniquely suited to both benefit greatly from ML and serve as a profound wellspring of inspiration for it.

Q: Taking a somewhat different tack, what problems in biology are still really resistant to our current tool set? Are there areas, perhaps specific challenges in disease or in wellness, which you feel are ripe for problem-solving?

A: Machine learning has demonstrated remarkable success in predictive tasks across domains such as image classification, natural language processing, and clinical risk modeling. However, in the biological sciences, predictive accuracy is often insufficient. The fundamental questions in these fields are inherently causal: How does a perturbation to a specific gene or pathway affect downstream cellular processes? What is the mechanism by which an intervention leads to a phenotypic change? Traditional machine learning models, which are primarily optimized for capturing statistical associations in observational data, often fail to answer such interventional queries.There is a strong need for biology and medicine to also inspire new foundational developments in machine learning. 

The field is now equipped with high-throughput perturbation technologies — such as pooled CRISPR screens, single-cell transcriptomics, and spatial profiling — that generate rich datasets under systematic interventions. These data modalities naturally call for the development of models that go beyond pattern recognition to support causal inference, active experimental design, and representation learning in settings with complex, structured latent variables. From a mathematical perspective, this requires tackling core questions of identifiability, sample efficiency, and the integration of combinatorial, geometric, and probabilistic tools. I believe that addressing these challenges will not only unlock new insights into the mechanisms of cellular systems, but also push the theoretical boundaries of machine learning.

With respect to foundation models, a consensus in the field is that we are still far from creating a holistic foundation model for biology across scales, similar to what ChatGPT represents in the language domain — a sort of digital organism capable of simulating all biological phenomena. While new foundation models emerge almost weekly, these models have thus far been specialized for a specific scale and question, and focus on one or a few modalities.

Significant progress has been made in predicting protein structures from their sequences. This success has highlighted the importance of iterative machine learning challenges, such as CASP (critical assessment of structure prediction), which have been instrumental in benchmarking state-of-the-art algorithms for protein structure prediction and driving their improvement.

The Schmidt Center is organizing challenges to increase awareness in the ML field and make progress in the development of methods to solve causal prediction problems that are so critical for the biomedical sciences. With the increasing availability of single-gene perturbation data at the single-cell level, I believe predicting the effect of single or combinatorial perturbations, and which perturbations could drive a desired phenotype, are solvable problems. With our Cell Perturbation Prediction Challenge (CPPC), we aim to provide the means to objectively test and benchmark algorithms for predicting the effect of new perturbations.

Another area where the field has made remarkable strides is disease diagnostic and patient triage. Machine learning algorithms can integrate different sources of patient information (data modalities), generate missing modalities, identify patterns that may be difficult for us to detect, and help stratify patients based on their disease risk. While we must remain cautious about potential biases in model predictions, the danger of models learning shortcuts instead of true correlations, and the risk of automation bias in clinical decision-making, I believe this is an area where machine learning is already having a significant impact.

Q: Let’s talk about some of the headlines coming out of the Schmidt Center recently. What current research do you think people should be particularly excited about, and why? 

A: In collaboration with Dr. Fei Chen at the Broad Institute, we have recently developed a method for the prediction of unseen proteins’ subcellular location, called PUPS. Many existing methods can only make predictions based on the specific protein and cell data on which they were trained. PUPS, however, combines a protein language model with an image in-painting model to utilize both protein sequences and cellular images. We demonstrate that the protein sequence input enables generalization to unseen proteins, and the cellular image input captures single-cell variability, enabling cell-type-specific predictions. The model learns how relevant each amino acid residue is for the predicted sub-cellular localization, and it can predict changes in localization due to mutations in the protein sequences. Since proteins’ function is strictly related to their subcellular localization, our predictions could provide insights into potential mechanisms of disease. In the future, we aim to extend this method to predict the localization of multiple proteins in a cell and possibly understand protein-protein interactions.

Together with Professor G.V. Shivashankar, a long-time collaborator at ETH Zürich, we have previously shown how simple images of cells stained with fluorescent DNA-intercalating dyes to label the chromatin can yield a lot of information about the state and fate of a cell in health and disease, when combined with machine learning algorithms. Recently, we have furthered this observation and proved the deep link between chromatin organization and gene regulation by developing Image2Reg, a method that enables the prediction of unseen genetically or chemically perturbed genes from chromatin images. Image2Reg utilizes convolutional neural networks to learn an informative representation of the chromatin images of perturbed cells. It also employs a graph convolutional network to create a gene embedding that captures the regulatory effects of genes based on protein-protein interaction data, integrated with cell-type-specific transcriptomic data. Finally, it learns a map between the resulting physical and biochemical representation of cells, allowing us to predict the perturbed gene modules based on chromatin images.

Furthermore, we recently finalized the development of a method for predicting the outcomes of unseen combinatorial gene perturbations and identifying the types of interactions occurring between the perturbed genes. MORPH can guide the design of the most informative perturbations for lab-in-a-loop experiments. Furthermore, the attention-based framework provably enables our method to identify causal relations among the genes, providing insights into the underlying gene regulatory programs. Finally, thanks to its modular structure, we can apply MORPH to perturbation data measured in various modalities, including not only transcriptomics, but also imaging. We are very excited about the potential of this method to enable the efficient exploration of the perturbation space to advance our understanding of cellular programs by bridging causal theory to important applications, with implications for both basic research and therapeutic applications.

Pages