top of page

WiDS Cambridge
Poster Session 2025

Poster abstracts were peer-reviewed and 21 submissions were selected for the session. WiDS Cambridge is fortunate to have such a wide range of highly qualified students, postdocs and researchers interested in participating in the poster session each year. All presenters are students, postdocs and early career research scientists, and will give a live Lightning Talk at the conference.

Poster Presentations 2025

​

Presenter: Jasmine Jerry Aloor

Affiliation: MIT

Title: Cooperation and Fairness in Multi-Agent Reinforcement Learning 

Authors: Jasmine Jerry Aloor, Siddharth Nayak, Sydney Dolan, Hamsa Balakrishnan

Abstract: Multi-agent systems are trained to maximize shared cost objectives, which typically reflect system-level efficiency. However, in the resource-constrained environments of mobility and transportation systems, efficiency may be achieved at the expense of fairness — certain agents may incur significantly greater costs or lower rewards compared to others. Tasks could be distributed inequitably, leading to some agents receiving an unfair advantage while others incur disproportionately high costs. It is, therefore, important to consider the tradeoffs between efficiency and fairness in such settings. We consider the problem of fair multi-agent navigation for a group of decentralized agents using multi-agent reinforcement learning. We consider the reciprocal of the coefficient of variation of the distances traveled by different agents as a measure of fairness and investigate whether agents can learn to be fair without significantly sacrificing efficiency (i.e., increasing the total distance traveled). We find that by training agents using min-max fair distance goal assignments along with a reward term that incentivizes fairness as they move towards their goals, the agents (1) learn a fair assignment of goals and (2) achieve almost perfect goal coverage in navigation scenarios using only local observations. For goal coverage scenarios, we find that, on average, the proposed model yields a 14% improvement in efficiency and a 5% improvement in fairness over a baseline model that is trained using random assignments.

 

Presenter: Ching Lam Choi

Affiliation: MIT

Title: Fairness Aware Preference Optimization

Author: Ching Lam Choi, Vighnesh Subramaniam, Mohammad-Amin Charusaie, Antonio Torralba, Phillip Isola, Stefanie Jegelka

Abstract:Reward models for LLM alignment are fit to the preferences of humans; current methods use preference data as the golden standard to fairness-align LLMs, through different pre- and post-processing techniques. However, human annotators may be demographically under- or over-represented. They could further be predisposed by e.g. race, gender, age attributes to disagree on controversial prompts. Due to human bias, selection bias and backdoor confounders, present approaches suffer from the non-identifiability of a pareto-efficient (fair and optimal) reward model, resulting in incomplete, suboptimal alignment policies that only partially observe algorithmic fairness. To mitigate trickle-down problems of unfairness (from data to reward model to RL policy to the finetuned LLM), we propose an in-processing approach to design a fair reward model, which in turn provides fair supervision to the LLM during RL finetuning. Concretely, we introduce "fairness aware reward optimisation" (FARO) into the reward modeling phase, constraining the reward to be independent of sensitive demographic attributes conditional on unrestricted features. Theoretically, we analyse the tradeoff between error and fairness for different fairness paradigms and demonstrate the pareto efficiency of FARO. Experiments on BBQ, PRISM, HolisticBias and other toxicity-measuring datasets underscore FARO's impact on bias and toxicity reduction, whilst preserving LLM quality, factuality and ability to model both ordinal and cardinal human preferences.

 

Presenter: Chiara Fusar Bassini

Affiliation: MIT

Title: Explainable Unsupervised Detection of Market Power-related Anomalies in ISO New England Electricity Market

Authors: Chiara Fusar Bassini, Lynn H. Kaack, Priya L. Donti

Abstract: 2000–2001 California electricity crisis, generation companies artificially reduced the electricity supply, causing blackouts and price spikes. Nowadays, many power system operators are constantly monitoring market power. Often they use structural indices - economic estimates based on costs, supply and demand. While these indices are easy to implement and intepret, their shortcomings are becoming more evident as systems increase in complexity. For instance, a Structural Pivotality Test checking whether a company's generation capacity is needed to meet demand would fail in case of coordinated generators' action. In well-connected and competitive markets, we expect market power abuse to affect only a few hours, which will deviate from typical observations. Therefore, one could use anomaly detection to identify such hours directly from the data. We compare two anomaly detection models (variational autoencoder and isolation forest) on ISO New England supply bid data against the currently in-use Structural Pivotality Test. Using ex-post local explanations (SHAP values), we reveal the key features causing the observation to be outlier.

 

Presenter: Jessy Xinyi Han

Affiliation: MIT

Title: Causal Frameworks for Decision Making in Criminal Justice and Healthcare

Author: Jessy Xinyi Han

Abstract: Ensuring fair decision-making is a critical challenge in high-stakes fields like criminal justice and healthcare. Traditional statistical methods often fail to distinguish between correlation and causation, leading to biased conclusions. To address this, we develop causal frameworks and data-driven methods to investigate disparities in these domains. In criminal justice, we introduce multi-stage causal models that systematically evaluate racial disparities and identify their potential sources across various stages, including police-civilian interactions and recidivism. Our empirical analysis of police stop and 911 call data in New Orleans reveals a counter-intuitive phenomenon that observational bias against the majority race is driven by the disproportionate over-reporting of minorities. Extending our analysis to recidivism, we apply survival analysis to examine the impact of socio-economic factors on recidivism across racial groups. While short-term recidivism rates appear similar when controlling for risk scores, significant long-term disparities emerge, suggesting unequal access to long-term support structures exacerbates these differences beyond algorithmic bias. We extend the use of survival analysis to healthcare, where we develop a prognostic model for relapsed/refractory T-cell lymphoma, enabling clinically meaningful risk stratification and guiding more precise treatments. Our findings highlight the necessity of causal inference for improving fair decision-making in both criminal justice and healthcare.

​

Presenter: Jenny Huang

Affiliation: MIT

Title: Approximations to Worst-case Data Dropping: Unmasking Failure Modes

Authors: Jenny Y. Huang, David R. Burt, Tin D. Nguyen, Yunyi Shen, Tamara Broderick

Abstract: A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one outlier may hide or conceal the effect of another outlier. Based on the failures we identify, we provide recommendations for users and suggest directions for future improvements.

 

Presenter: Shreya Johri

Affiliation: Harvard University

Title: An Evaluation Framework for Clinical Use of Large Language Models

Authors: Shreya Johri, Jaehwan Jeong, Benjamin Tran, Daniel Schlessinger, Shannon Wongvibulsin, Leandra Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer Van Allen, David Kim, Roxana Daneshjou, Pranav Rajpurkar

Abstract: Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of 10 recent LLMs across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor–patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

 

Presenter: Eirini Katsidoniotaki

Affiliation: MIT

Title: Reduced Order Modeling of Marine Energy Systems via Sequential Bayesian Experimental Design and Machine Learning

Authors: Eirini Katsidoniotaki, Themistoklis Sapsis

Abstract: Marine energy technologies face significant challenges in ensuring their survivability under extreme ocean conditions. Quantifying extreme load statistics on marine energy structures is essential for reliable structural design; however, this is a challenging task due to the scarcity of high-quality data and the inherent uncertainties associated with predicting rare events. While computational fluid dynamics (CFD) simulations can accurately capture the nonlinear dynamics and loads in extreme wave–structure interactions, providing high-fidelity data, extracting statistical information through these models is computationally impractical. This poster shows a reduced-order modeling framework for marine energy systems, enabling efficient analysis across diverse scenarios, and facilitating the quantification of extreme load statistics with significantly reduced computational cost. Specifically, a hybrid reduced-order or surrogate model for a wave energy converter is developed to map extreme sea states and design parameters to the resulting loads in the mooring system. The term ”hybrid” refers to the combination of Gaussian Process Regression (GPR) and Long Short-Term Memory (LSTM) neural networks. The model is developed using an active learning approach that strategically selects the most informative CFD samples from regions of the input space associated with extreme mooring loads. This procedure iteratively refines the model while minimizing prediction uncertainty, making it particularly effective for real-world applications where obtaining each sample requires substantial time and resources. The developed model demonstrates its exceptional ability to efficiently predict complex load time series, including instantaneous peaks, at speeds significantly faster than traditional modeling methods. Subsequently, the model is utilized to effectively evaluate Monte Carlo samples, providing accurate estimates of the probability of extreme mooring loads. Understanding the expected extreme loads is essential during the design phase of marine energy systems, enabling cost reduction by optimizing strength margins, refining overly conservative safety factors, and enhancing overall system reliability.

 

Presenter: Dahye Kim

Affiliation: Boston University

Title: Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Authors: Dahye Kim, Deepti Ghadiyaram

Abstract: Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of 20.01% in unsafe concept removal, is effective in style manipulation, and is ∼5x faster than current state-of-the-art.

​

Presenter: Samantha Klasfeld

Affiliation: Pfizer

Title: Coding cis pQTLs from Proteogenomic Data Allow Evaluation of the Performance of Missense Variant Effect Predictions and the Utility of their Application to Rare Variant Association Analyses. 

Authors: Samantha J. Klasfeld, Eric B. Fauman, Melissa R. Miller,  Hye In Kim

Abstract: Large-scale sequencing studies have identified numerous rare protein-altering coding variants, but their functional effects remain largely unknown. This study evaluates the performance of new variant effect prediction methods such as AlphaMissense and ESM1b, which utilize unsupervised machine learning models based on sequence conservation, context, and protein structure. Using coding cis pQTLs as a benchmark, we tested the association between 2,941 plasma protein levels and cis coding variants in 32,325 individuals, identifying 3,484 significant coding cis pQTLs (p-value < 1.2e-7). These were used to benchmark prediction scores from dbNSFP, ESM1b, and AlphaMissense. Missense variants with higher prediction scores for being deleterious were enriched for coding cis pQTLs and had stronger effects on protein levels. The correlations between prediction scores and effect sizes ranged from 0.2 to 0.3. Using missense prediction scores to guide variant inclusion improved rare variant association analyses, identifying unique deleterious variants and improving statistical power. Future work will involve evaluating prediction methods using mutagenesis datasets and incorporating prediction scores as continuous weights in burden analyses to enhance statistical power.

 

Presenter: Ching-Yun Ko

Affiliation: IBM 

Title: Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?

Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano

Abstract: Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order  of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under  a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This  generalization of MemReasoner  is achieved  using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of  explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.

 

Presenter: Sukanya Krishna

Affiliation: Harvard University

Title: Advanced Machine Learning for Substance Overdose-Related Mortality Prediction

Author: Sukanya Krishna, Marie-Laure Charpignon, Maimuna Majumder

Abstract: In 2023, substance overdoses claimed over 81,000 lives in the U.S., highlighting the need for more accurate predictive models to inform public health interventions. Traditional statistical methods like Seasonal Autoregressive Integrated Moving Average (SARIMA) struggle with nonlinear trends and policy interventions. This study explores deep learning (DL) models—Long Short-Term Memory (LSTM) networks and Temporal Fusion Transformer (TFT) models—to improve overdose mortality predictions. Using CDC WONDER mortality data, we train and evaluate LSTM and TFT models against SARIMA, assessing accuracy via mean absolute percentage error (MAPE) and precision via prediction interval (PI) length. Preliminary results show that LSTM outperforms SARIMA in validation (MAPE: LSTM 2.99% vs. SARIMA 4.00%) and testing (LSTM 15.24% vs. SARIMA 16.23%), suggesting improved predictive power. TFT models will integrate socioeconomic and behavioral risk factors to capture complex interactions and enhance interpretability. Future work includes uncertainty estimation using conformal prediction and Monte Carlo dropout to improve reliability. Our findings suggest that DL models may outperform SARIMA in forecasting overdose mortality, providing actionable insights into epidemic trends across substances and regions. This work contributes to refining resource allocation and intervention strategies, leveraging machine learning for data-driven public health responses.

 

Presenter: Jessica Quaye

Affiliation: Harvard University

Title: From seed to harvest: Augmenting human creativity with AI for efficient red-teaming of text-to-image generative models

Author: Jessica Quaye, Alicia Parrish, Oana Inel, Minsuk Kahng, Charvi Rastogi, Vijay Janapa Reddi, Lora Ayoro

Abstract: Limited datasets of adversarial prompts and corresponding images exist for evaluating text-to-image (T2I) model safety. Current techniques for generating these prompts are either purely human-driven or automated. Human-generated datasets are often small and sometimes imbalanced, while automatically-generated datasets, despite their scalability, often lack diversity and realistic human elements encountered in practice. To address this gap, I combine the strength of both human and automated approaches to develop a hybrid red-teaming technique that creates an augmented dataset from human-written implicitly adversarial prompts. This augmented dataset consists of realistic and semantically similar prompts, generated in a constrained yet scalable manner. It allows for scaling up the dataset through a series of traceable and understandable steps. My technique attains a higher success rate than purely human-generated prompts, while preserving the realistic nature of these prompts and replicating their ability to identify real-world harms. This work highlights the importance of human-machine collaboration to leverage human creativity in scalable red-teaming techniques to continuously enhance T2I model safety. 

 

Presenter: Shreyaa Raghavan

Affiliation: MIT

Title: Quantifying Controllable Congestion with Traffic Flow Optimization

Authors: Shreyaa Raghavan, Edgar Ramirez Sanchez, Cathy Wu

Abstract: To minimize the negative costs of traffic congestion (emissions, accidents, longer travel times, etc.), a prominent line of work has emerged on developing intelligent transportation systems (ITS) and traffic control, such as autonomous vehicles or variable speed limits, to coordinate and optimize the flow of traffic. However, the extent to which traffic congestion can be mitigated via ITS is unknown. We propose an alternative categorization of congestion that measures the extent to which it is controllable through operational means of traffic control. In this work, we motivate the need for quantifying controllable congestion, show how to solve for controllable congestion using a second-order traffic simulation known as METANET, and optimize target speeds on a large-scale road network with model predictive control (MPC). These target speeds could, in practice, be actuated via autonomous vehicles, mobile app notifications, or variable speed limit signs. Our results show that this method reduces travel times on congested highways and provides a better approximation of controllable congestion compared to existing metrics of congestion, such as delay. This work ultimately aims to allow public and private sectors to more effectively devote their resources to ITS technologies.

 

Presenter: Maria Sol Rosito

Affiliation:Harvard T.H. Chan School of Public Health and Dana-Farber Cancer Institute

Title: Interpretable Machine Learning for Pedigree Data Deduplication in Cancer Genetics

Authors: Maria Sol Rosito, Aleck Cervantes, Christine Hong, Joseph Bonner, Stephen Gruber, and Danielle Braun

Abstract: Li-Fraumeni syndrome (LFS) caused by mutations in the TP53 gene is associated with a significantly increased risk of early onset cancer. Accurate estimates of age-specific cancer risk for individuals with LFS are essential for guiding clinical decision-making, including treatment and screening strategies. However, estimating risk based on duplicate records in family-based datasets, resulting from multiple clinical visits or inconsistent naming conventions can lead to bias. Despite the wide adoption of machine learning for data integrity, its application to pedigree data deduplication remains underexplored. To address this challenge, we propose two complementary, interpretable approaches for pedigree deduplication. First, we develop graph-based features that capture structural and node-level discrepancies within pedigrees and train a random forest classifier on a dataset of families affected with LFS and augmented with synthetic duplicates. Second, we introduce a heuristic partial labeling strategy that leverages genetic variant clustering to identify high confidence duplicate records while maintaining a low false positive rate. By integrating these methods, we aim to establish a robust, interpretable workflow for enhancing pedigree data quality leading to more accurate cancer risk estimates.

 

Presenter: Maryann Rui

Affiliation: MIT

Title: Learning Mixtures of Linear Systems

Authors: Maryann Rui, Munther Dahleh

Abstract: With time series data, such as healthcare, social science, and biological data, there is often a large number of systems (patients, groups, cells), but with a limited amount of data available per individual system. Without additional structure, it may be difficult to identify individual models for each observed system from the data. However, within these settings, mixture models allow for tractability and efficiency of learning multiple models from data. In this project, we study the problem of learning mixtures of linear dynamical systems (MLDS) from input-output data. This mixture setting allows us to leverage observations from related dynamical systems to improve the estimation of individual models. Building on spectral methods for mixtures of linear regressions, we propose a moment-based estimator that uses tensor decomposition to estimate the impulse response of component models of the mixture. The estimator improves upon existing tensor decomposition approaches for MLDS by utilizing the entire length of the observed trajectories. We provide sample complexity bounds for estimating MLDS in the presence of noise, in terms of both N (number of trajectories) and T (trajectory length), and demonstrate the performance of our estimator through simulations.

 

Presenter: Mridula (Mally) Shan

Affiliation: Harvard

Title: Real-Time Monitoring of Health Data Breaches with R-based Dashboard

Author: Mridula (Mally) Shan

Abstract: In almost every month of 2020, more than 1 million people were affected by data breaches at healthcare organizations; this was a 9851% increase from 2019, leading to security exploitations in 560 healthcare organizations. To help patients and providers monitor such attacks in real-time, I developed an interactive, statistical analysis dashboard using data published by the Department of Health and Human Services from 2009 to 2025 (available at: mal-shan.shinyapps.io/healthsecurity). The dashboard includes: an interactive map of the United States with population-level statistics on the number of affected patients; a reactive plot comparing trends in healthcare breaches over time; and interactive linear models comparing the prevalence of attacks. We see from these visualizations that there are differences in the median number of affected individuals (p<0.01) depending on who the attack is directed against. For example, a Tukey HSD test demonstrates that more data is breached in attacks against health plans (95% CI [ -0.357, -0.154]) or healthcare providers (95% CI [-0.293, -0.142]) compared to business associates or clearing houses. By determining such trends in health cyberattacks, this dashboard demonstrates where resources should be dedicated for the improvement of medical data security. 

 

Presenter: Yuwen Tan

Affiliation: Boston University

Title: Lifting Data for Foundation Model Unlearning

Authors: Yuwen Tan

Abstract: Machine unlearning removes certain training data points and their influence on AI models (e.g. when a data owner revokes their decision to allow models to learn from the data). We propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., having no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points.

 

Presenter: Njesa Totty

Affiliation: Framingham State University

Title: Hyperparameter-Tuned Oversampling of Imbalanced Data with Overlapping Features

Authors: Njesa Totty, Claudio Fuentes, James Molyneux

Abstract: We present a novel and flexible oversampling technique for binary classification with an imbalanced response variable and overlap in the feature space. Imbalance and overlap are known to inhibit classifier performance (e.g. Li et al. 2021; Shahee and Ananthakumar 2021). Most solutions to this problem build on the popular and influential Synthetic Minority Oversampling TEchnique (SMOTE) (e.g. Chawla et al. 2002; Elreedy and Atiya 2019). Limitations of SMOTE include inflexibility in determining which minority examples to oversample and the treatment of categorical variables. Based on extensive simulations that characterized the impacts of imbalance and overlap under a variety of data difficulties and models, we present a novel algorithm that allows one to tune the hyperparameters of the oversampling process. This can lead to improved predictive performance of classification models which we example with an application to institutional research.

​

Presenter: Neslihan Yildiz-Ozhan

Affiliation: Brigham and Women's Hospital, Harvard Medical School

Title: Neighborhoods and Children’s Neurodevelopment: Unraveling the Link Between Environment, Brain Structure, and Cognition

Authors: Neslihan Yildiz-Ozhan, Benson S Ku, Ryan Zurrin, Lauren J. O’Donnell, Martha E. Shenton, Yogesh Rathi, Johanna Seitz-Holland, Suheyla Cetin-Karayumak

Abstract: Neighborhood opportunities shape children’s physical and mental well-being, yet their impact on brain and cognitive development remains underexplored. Neural pathways essential for cognition are particularly sensitive to environmental influences due to their prolonged maturation. This study employs a data-driven, multi-modal approach to investigate how neighborhood opportunities influence neurodevelopment using data from 6,141 children (ages 8-10) in the Adolescent Brain Cognitive Development Study, a large-scale dataset collected across 21 U.S. sites. Neighborhood opportunities, assessed via the Child Opportunity Index, capture disparities in education, socioeconomic resources, and environmental quality. A key challenge in multi-site neuroimaging studies is scanner-related variability differences. We addressed this using our well-established harmonization framework based on Rotation Invariant Spherical Harmonics to ensure cross-site comparability. Using Generalized Additive Models and mediation analyses, we found that children from higher-opportunity neighborhoods demonstrated stronger cognitive functions—including memory, language, and decision-making—with enhanced neural connectivity partially mediating this relationship. This study leverages large-scale multi-modal data and advanced neuroimaging methods, together with data analytics to highlight how structural inequities shape neurodevelopment in children. Findings underscore the importance of public health initiatives targeting education, environmental quality, and socioeconomic equity to promote children’s cognitive and brain health.

​

Presenter: Arianna Zuanazzi

Affiliation: Child Mind Institute

Title: Data Science Engagement through Open Data

Authors: Arianna Zuanazzi

Abstract: The Open Science movement promotes transparency, accessibility, reproducibility, and collaboration, making scientific research more sustainable and inclusive. A key component of this movement is Open Data—the practice of making data openly available, accessible, and reusable. Open Data reduces costs, minimizes redundancy, and fosters interdisciplinary collaborations by enabling researchers and stakeholders from diverse fields to develop innovative solutions. Here, I will introduce Open Data within the broader framework of Open Science, outlining its benefits and the essential steps for managing and sharing large datasets with researchers and diverse communities. I will then share our work collaborating with Women in Data Science (WiDS) to organize data science competitions focused on mental health, leveraging the open-access multimodal dataset from the Healthy Brain Network (HBN), a community-based research initiative of the Child Mind Institute. I will highlight how initiatives such as the WiDS “Unraveling the Mysteries of the Female Brain” Datathons expand community engagement, empower women worldwide to strengthen their data science skills, and raise awareness about mental health challenges faced by girls and women. 

bottom of page