AI Makes Scientific Discoveries on Its Own: Introducing CodeScientist

April 2, 2025

2 minutes

🟢easy Reading Level

A research team led by Peter Jansen has developed CodeScientist, an AI system designed to generate and test scientific hypotheses. The project brings together researchers from the Allen Institute for Artificial Intelligence, the University of Washington, the University of Arizona, and the Hebrew University of Jerusalem.

The system has identified 19 potential discoveries in AI and virtual environments research, with six meeting the criteria for scientific soundness and incremental novelty after peer review.

"We're working toward computer systems that can help identify gaps in scientific knowledge and design experiments to fill them," says Jansen. "CodeScientist is a step in that direction, though we still have much to learn."

How It Works

CodeScientist follows a five-step process:

Ideation: The system reads research papers and code examples to generate research ideas. For this study, it analyzed 57 papers in agent architectures and virtual environments.
Planning: It creates detailed experimental plans and identifies needed code components.
Experiment Construction: The system writes and debugs code through an iterative process.
Reporting: It generates reports on methodology and results.
Meta-Analysis: Each experiment runs multiple times to verify results.

Technical Details

The research team used:

Claude-Sonnet-3.5-1022 as the base model
GPT-4O-Mini for experiments
Resource limits of $10 and 6 hours per experiment
250 total experiment runs

Research Findings

From the experiments, six findings passed peer review:

Prediction confidence: Language models show low correlation between self-assessed confidence and actual accuracy in state predictions.
Representation effects: Models perform better with simpler state representations compared to complex text.
Environment generation: Multi-stage generation of virtual environments produces better results than single-stage approaches.
Optimization challenges: The study identified specific limitations in language models' ability to solve component substitution problems.
Action prediction: Models struggle to predict action success with limited context.
Memory structures: Graph-based memory helps improve agent performance in discovery tasks.

Current Limitations

The team identified several challenges:

41% of experiments complete successfully
Code validation requires significant human review
Results often need verification with larger sample sizes
Some experiments show methodological issues

Human Oversight

The process currently needs human input for:

Selecting research papers
Verifying code examples
Filtering initial ideas
Providing expert feedback
Validating results

In tests without human guidance, the success rate dropped from 12% to 2%.

Conclusion

The research team plans to focus on improving code validation methods, increasing experiment reliability, reducing dependency on human oversight, and expanding to other scientific domains. "While we can't yet make groundbreaking discoveries automatically," Jansen notes, "we're learning how AI can help with scientific research, especially in processing and analyzing large amounts of information." The team has open-sourced CodeScientist to encourage further development in automated scientific discovery.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass

Live Courses