Building Vibecheck: Re-imagining technical interviews at CalHacks 12.0

Evan Yu
12 min read

My experience building Vibecheck at Calhacks 12.0

Two weeks ago, I travelled to San Francisco with 3 friends to participate in Cal Hacks 12.0, a 48 hour hackathon hosted by UC Berkeley at the Palace of Fine arts. Our team consisted of 4 members: Me (Evan Yu), Ethan Ng, Luke Griggs, and Aaronkhan Hubhachen.

At the start of the hackathon, we didn't have a clear idea of what to build. The one thing we all had in common, though, was our experience with technical interviews. We'd all spent hours balancing binary search trees, optimizing dynamic programming problems, and memorizing Dijkstra's algorithm. These skills, ironically, rarely come up in real-world software development.

With the advent of AI, developers are expected to be able to effectively utilize AI in their daily work. Every day, we use AI tools like ChatGPT, Cursor, and Claude Code to help us ship code faster and more efficiently.

So, we decided to build VibeCheck: a interview platform that reimagines the traditional technical interview process; A platform that evaluates developers holistically. Our goal was to evaluate developers not just on their coding ability, but also on how effectively they use AI, leveraging it as a tool rather than becoming blindly dependent on it (vibecoding).

Essentially, we wanted to create a interview platform that tests engineers on what they actually do on the job, with the help of AI.

The Process

Day 1: The Beginning

I landed in SF friday morning with a sore throat (which eventually evolved into a full-blown cold). At the venue, I met up with Ethan and Luke, and we spent the first couple hours running around the venue, grabbing all the free swag we could get our hands on.

Hour 3 - Brainstorming

Once the hackathon began and our team finally gathered, we dove straight into brainstorming ideas. We came in completely unprepared, and as we talked, Aaron just arrived fresh from an interview. That got us talking on how challenging and frustrating technical interviews were. Before that, I mentioned that I had some experience building in-browser IDEs (like Runway and PyEval). From there, the idea for VibeCheck started to take shape: an interview platform that lets candidates code in an environment that feels just like the one they'd use on the job.

Hour 4 - The Roadblock

As we started scoping out the project, we quickly hit a major roadblock: the Wi-Fi was awful, and we could barely get anything done. To make matters worse, the air quality was terrible, which helped to further irritate my sore throat. That's when we decided to pack up and head to Luke's brother's apartment on the south side of SF.

Once we settled in, we split up the work: I focused on building the in-browser IDE, drawing on my experience with Monaco and WebContainers. Ethan took charge of the AI implementation and connecting it to the filesystem/editor, while Luke and Aaron worked on building and validating the holistic AI grading system.

Day 2: Integration

We woke up early on Saturday morning, well-rested and ready to keep building. I spent the day continuing development on the in-browser IDE, while Ethan shipped an MVP of the AI integration featuring patch-based edits. That night, we had to vacate the apartment since Luke's brother had friends over. We relocated to a warehouse-turned-hacker house Jared was renting out, called Vivarium (check them out, they're pretty cool (and disturbing)). I stayed up until 7 a.m. implementing proper tool calling on the client side and completing the integration between the onboarding flow and the submission flow.

Vivarium

Day 3: Chaos...

At 8 a.m. on Sunday morning, we finally finished the project. Instead of recording our demo right then and there, we decided to Uber back to the venue, assuming the Wi-Fi would be better. Big mistake. The Wi-Fi was somehow even worse than before; we couldn’t connect to it at all. Desperate, we ended up recording our demo crouched on the sidewalk outside the Palace of Fine Arts, relying on painfully slow cellular data. In a stroke of genius, we realized that YouTube Studio allows post-upload editing, so we uploaded our demo, submitted the link at the last minute, and painstakingly trimmed the video to fit the time limit; all on sluggish cellular data, as the airwaves were completely clogged with hundreds of other hackers doing the same. To make matters worse, we were hit with one issue after another: the demo would randomly hit edge cases and break, our audio cut out at the worst times, and our cell connection kept dropping. Somehow, through all of that, we still managed to get the final demo submitted just in time after a clutch 5 minute deadline extension.

So, what did we build?

VibeCheck is a full-stack interview platform that evaluates candidates in a realistic development environment with AI assistance. The platform consists of three main components: the onboarding flow, the in-browser IDE, and the holistic grading system.

The onboarding flow

Candidates begin by selecting their preferred framework (React Router v7 or Next.js) and experience level. Based on these selections, our AI generates a personalized coding challenge that matches their skill level.

The assignments are real-world scenarios, like building an e-commerce product list with filtering and sorting, creating a blog dashboard with CRUD operations, or implementing a task management system. These aren't your typical LeetCode problems—they're tasks that actually test whether you can build functional, user-facing applications with the help of AI, just like you would on the job.

The system dynamically generates:

  • A detailed project specification with requirements and acceptance criteria
  • A custom rubric for grading based on the specific challenge
  • Framework-specific boilerplate code to get candidates started quickly

Once the challenge is generated, candidates are dropped into the IDE where they can code with full AI assistance for a set time period (typically 60-90 minutes).

Assignment generation

The IDE

We built a fully in-browser Cursor-like IDE that feels like a real development environment. The architecture consists of several key pieces:

  • Monaco Editor - The same editor that powers VS Code, providing syntax highlighting, IntelliSense, and familiar keyboard shortcuts
  • WebContainer - A full Node.js runtime that runs entirely in the browser, allowing us to execute npm commands, run dev servers, and preview applications without any backend infrastructure
  • zen-fs - An in-memory filesystem that provides a Unix-like file system interface, enabling file operations, directory traversal, and proper file watching
  • React + Tailwind - For building a responsive, modern UI that ties everything together

The IDE features a split-pane layout with the code editor on the left, a live preview on the right, and an AI chat interface accessible via a sidebar. Candidates can edit files, see changes reflected in real-time in the preview, and interact with the AI assistant to help them build their solution.

One of the biggest technical challenges was getting WebContainer, Monaco, and zen-fs to all work together seamlessly. We had to carefully manage file system synchronization to ensure changes made by the AI or the editor were immediately reflected across all systems.

AI Editor

The AI

We used the Vercel AI SDK with Claude Sonnet 4.5 as the underlying model powering all AI features, including chat, assignment generation, and code editing.

The AI assistant can help candidates in several ways:

  • Answering questions about the requirements
  • Suggesting architectural approaches
  • Generating code snippets or entire files
  • Debugging errors and issues
  • Refactoring existing code

The LLM generates diffs in a structured format that we parse and display in the UI before applying them to the files. Because LLMs often produce malformed diffs, I implemented an automatic failover system: if the diff fails to apply cleanly, we instruct Claude Haiku to manually reconstruct the file with the intended changes.

The grader

The holistic grading system uses Claude Sonnet 4.5 to evaluate submissions across multiple dimensions. Unlike traditional automated tests that only check if the code works, our grader considers code quality, architecture decisions, error handling, user experience, and most importantly, how effectively the candidate utilized AI assistance.

The grading system analyzes the entire codebase with the following prompt:

function getGradingPrompt(
  challenge: string,
  framework: string,
  rubric: Record<string, unknown>,
  code: Record<string, string>,
): string {
  const codeString = Object.entries(code)
    .map(([path, content]) => `// ${path}\n${content}`)
    .join("\n\n---\n\n");
 
  return `
You are an expert technical interviewer grading a coding assessment. The candidate used AI assistance during this assessment.
 
## CHALLENGE
${challenge}
 
## FRAMEWORK
${framework === "react-router-v7" ? "React Router v7" : "Next.js"}
 
## GRADING RUBRIC
${JSON.stringify(rubric, null, 2)}
 
## SUBMITTED CODE
${codeString}
 
## INSTRUCTIONS
1. Grade each category in the rubric based on the submitted code
2. Assign scores (0-100) for each category
3. Provide specific, actionable feedback that references actual code
4. Mark each criterion as "passed", "partial", or "failed"
5. Calculate the overall score using the weighted average from rubric
6. Assess AI utilization effectiveness (how well they used AI assistance)
7. Assess adherence to the original prompt/requirements
8. Be fair but rigorous - this was AI-assisted, so expectations are higher for code quality
 
## IMPORTANT
- Look for framework-specific best practices (${
    framework === "react-router-v7"
    ? "loaders, actions, routing"
    : "Server Components, Server Actions, App Router"
})
- Consider code organization, error handling, and user experience
- Check if requirements were met
- Evaluate the quality of AI usage (clear variable names, good structure suggests good prompting)
 
## OUTPUT FORMAT
Return ONLY valid JSON in this exact structure:
{
"overallScore": 82,
"overallGrade": "Good",
"categories": [
    {
    "name": "Category Name from Rubric",
    "score": 85,
    "grade": "Excellent|Great|Good|Needs Improvement|Poor",
    "feedback": "Specific feedback referencing actual code...",
    "criteria": [
        {
        "name": "Criterion Name",
        "status": "passed|partial|failed"
        }
    ]
    }
],
"aiUtilization": {
    "effectiveness": 88,
    "feedback": "Analysis of how effectively AI was used..."
},
"adherenceToPrompt": 78,
"strengths": ["Strength 1", "Strength 2"],
"areasForImprovement": ["Area 1", "Area 2"]
}`;
}

The grader returns detailed feedback on each category in the rubric, identifying strengths and areas for improvement. It also evaluates "AI utilization effectiveness"—a metric that measures whether the candidate used AI as a productivity tool or as a crutch. For example, clear variable names, good code organization, and thoughtful error handling indicate effective prompting and understanding, while messy, inconsistent code suggests blind copy-pasting.

The results

After presenting VibeCheck during the judging session, we received overwhelmingly positive feedback from both fellow hackers, judges, and sponsors (especially Anthropic!). Many people resonated with our mission to modernize technical interviews to reflect how developers actually work today.

While we didn't place in the official rankings, we walked away with something more valuable: validation that we were solving a real problem.

Reflections

CalHacks 12.0 was, without a doubt, the hardest hackathon I've attended to date. I was sick throughout the entire event, pulling an all-nighter Saturday into Sunday, coding until 7 a.m. with a sore throat and low fever, fueled purely by adrenaline, determination, and a unhealthy amount of caffiene that probably took 10 years off my lifespan.

Yet despite all of this, I shipped like I've never shipped before. Building VibeCheck was one of the most intense and rewarding experiences of my life. I'm especially proud of what we managed to ship in such a short timeframe: a fully functional interview platform that re-implements Cursor's AI-assisted coding experience in the browser, in just 48 hours. When you're sick, sleep-deprived, and racing against the clock, you learn what you're truly capable of.

A few key takeaways:

AI is a force multiplier. We extensively used Claude, Cursor, and ChatGPT throughout development. The AI helped us ship faster, debug issues, and implement features we weren't familiar with. Ironically, we proved our own thesis: that AI doesn't replace developers, it makes good developers even more productive.

"Just make it exist first, you can make it good later." We cut a lot of corners and shipped features that were "good enough" rather than perfect. The UI could have been prettier, the AI could have been smarter, the grader could have been more sophisticated. But we prioritized getting something functional into people's hands.

Most importantly, this project reinforced my belief that the future of technical interviews shouldn't be about memorizing algorithms or whiteboarding tree traversals. It should be about evaluating how effectively you can leverage the tools available to you to build something real.

AI is here to stay, and if we don't adapt, we'll be left behind.


Thank you for reading! If you're interested, star the code on GitHub, and check out the Devpost!

Also read Ethan's blog post for his perspective!

Images

CalHacks Event