Structured AI Development for Android: A Complete Workflow from Design to Play Store

Why Unstructured AI Coding Fails for Android Projects

You open your AI coding agent, type “add a settings screen with dark mode toggle,” and the agent immediately starts writing code. Three minutes later, it’s generated a SettingsActivity with XML layouts — except your app uses single-activity Compose navigation. It didn’t check your architecture. It didn’t ask about your theming approach. It didn’t write tests. And now you’re spending more time fixing its output than you would have spent writing the feature yourself.

This is the default behavior of every AI coding agent today: jump straight to implementation. For trivial scripts and one-off utilities, that works fine. For structured AI Android development — where you have architecture patterns, DI graphs, Compose design systems, and a test suite to maintain — it’s a recipe for technical debt at machine speed.

The Superpowers framework by Jesse Vincent introduced a better approach: force the AI agent through a disciplined pipeline of design, planning, test-driven implementation, and verification. In this guide, you’ll learn how to adapt that methodology specifically to Android projects — with the architecture patterns, testing tools, and Gradle workflows you actually use.

The Pipeline: Seven Phases from Idea to Play Store

Before diving into each phase, here’s the full picture. Every feature request flows through seven mandatory gates. No skipping, no shortcuts.

1. Brainstorm
Design & validate
2. Plan
Decompose tasks
3. Isolate
Git worktree
4. Implement
TDD + subagents
5. Debug
Root cause first
6. Review
Spec + quality
7. Verify
Evidence-based

Each phase has an iron law — a non-negotiable rule that must be satisfied before the agent can proceed. If the agent tries to skip a gate, it gets blocked. This is what separates disciplined AI development from “vibe coding.”

Phase 1: Brainstorming — Design Before You Code

Iron law: No code is written until the design is presented and the human approves it.

When you ask an AI agent to build a feature, the first thing it should do is explore your project. For an Android app, that means understanding your architecture before proposing anything. The agent should check your existing module structure, your navigation graph, your DI setup, your design system, and your existing patterns.

Here’s what the brainstorming phase looks like adapted to Android:

Step 1: Context gathering. The agent reads your project’s key files — build.gradle.kts files, your AppModule (Hilt/Koin), your NavGraph, your theme definition, and your CLAUDE.md or architecture docs. For Android projects, these files tell the agent more than any description you could write.

Step 2: Clarifying questions. Instead of assuming, the agent asks one question at a time. For a “settings screen” request, it might ask: “Your app uses Compose with Material 3 dynamic theming — should the dark mode toggle persist via DataStore and recompose the theme, or does your hybrid Compose/AppCompat setup require AppCompatDelegate.setDefaultNightMode()?” That’s the kind of Android-specific question that prevents bad implementations.

Step 3: Propose approaches. The agent presents 2-3 options with trade-offs. For example: (A) DataStore + recomposition with a simple boolean preference, (B) full theme engine with system/light/dark modes using a sealed class, or (C) follow Material You and let the system handle it entirely. Each option includes which Compose APIs, what state management pattern, and how it fits your existing architecture.

Step 4: Design doc. Once you pick an approach, the agent writes a short design document — not a novel, just enough to be unambiguous. For Android, this includes the data flow diagram, which layer owns the state (ViewModel vs DataStore), which Compose components are new vs modified, and which existing tests need updating.

Only after you explicitly approve the design does the agent move to Phase 2. If you’ve been working with Claude Code and a well-structured CLAUDE.md, the brainstorming phase becomes much faster — the agent already knows your patterns.

Phase 2: Planning — Break It Into Bite-Sized Tasks

Iron law: Every task must be completable in 2-5 minutes with exact file paths, complete code, and expected outputs.

This is where most developers underestimate the value of AI discipline. A vague plan like “implement the settings screen” leads to an agent that makes dozens of decisions on its own, many of them wrong. A precise plan eliminates ambiguity.

For Android projects, each task in the plan should specify:

  • The exact file to create or modify (e.g., feature/settings/src/main/java/…/SettingsViewModel.kt)
  • Complete code to write — not a description, the actual code
  • Which tests to write or update, with the test file path
  • The Gradle command to verify (e.g., ./gradlew :feature:settings:testDebugUnitTest)
  • Expected output (“BUILD SUCCESSFUL, 12 tests passed”)

Here’s how a typical Android feature gets decomposed:

Task Layer Deliverables Verification
1. Data layer Repository SettingsRepository + DataStore Unit tests pass
2. Domain layer UseCase GetThemeUseCase, SetThemeUseCase Unit tests pass
3. DI wiring Hilt module SettingsModule with bindings Hilt compilation succeeds
4. ViewModel Presentation SettingsViewModel + UiState ViewModel tests pass
5. Compose UI UI SettingsScreen composable Compose preview + UI tests
6. Navigation App Route + NavGraph entry Navigation test passes

Notice the pattern: data layer first, then domain, then presentation, then wiring. This bottom-up approach means each task builds on verified, tested code from the previous task. The agent never writes UI code against an untested ViewModel.

Phase 3: Isolation — Git Worktrees for Safe Development

Iron law: Never work directly on main. Every feature gets an isolated worktree.

Before the agent writes a single line of code, it creates a git worktree — an isolated copy of your repository on a separate branch. This is especially important for Android projects because Gradle’s build cache and incremental compilation can produce confusing results when switching branches in a single working directory.

The workflow is simple: the agent runs git worktree add to create a new branch and working directory, then runs your project’s setup commands (typically ./gradlew assembleDebug to verify the build is clean), and establishes a test baseline by running the full test suite. Only after all tests pass on the fresh worktree does the agent begin implementing tasks.

Why this matters for Android specifically: Hilt’s code generation, Room’s schema validation, and Compose’s compiler plugin all produce generated code that can conflict across branches. A worktree avoids the “works on my branch, breaks on main” problem entirely.

Phase 4: Implementation — TDD with Subagents

This is the core of the methodology, and it combines two powerful ideas: test-driven development and subagent delegation.

The TDD Cycle for Android

Iron law: No production code without a failing test first. No exceptions.

RED
Write a failing test
What should this code DO?
GREEN
Write minimal code to pass
Simplest thing that works
REFACTOR
Clean up, tests stay green
Remove duplication, improve names

For Android, each layer of the architecture has different testing tools and patterns. Here’s how TDD maps to an Android project:

Repository / Data layer: Write unit tests with JUnit 5 and kotlinx-coroutines-test. If the repository wraps a Room DAO, use an in-memory database for fast tests. The failing test defines the contract: “when I call getTheme(), it returns the stored preference.” (Use .first() to extract a single Flow value in tests; production UI code uses .collectAsState() in Compose.) Then write the implementation.

@Test
fun `getTheme returns dark when preference is set to dark`() = runTest {
    // Arrange (THEME_KEY = stringPreferencesKey("theme"))
    dataStore.edit { prefs ->
        prefs[THEME_KEY] = "dark"
    }

    // Act
    val result = repository.getTheme().first()

    // Assert
    assertEquals(Theme.DARK, result)
}

ViewModel: Test with Turbine for Flow assertions and a TestDispatcher for coroutine control. The failing test defines the state transitions: “when toggleDarkMode() is called, the UI state emits isDarkMode = true.” If you’ve used StateFlow vs SharedFlow patterns before, you know exactly how to assert against the ViewModel’s exposed state.

@Test
fun `toggling dark mode updates ui state`() = runTest {
    val viewModel = SettingsViewModel(
        getThemeUseCase = FakeGetThemeUseCase(Theme.LIGHT),
        setThemeUseCase = FakeSetThemeUseCase()
    )

    viewModel.uiState.test {
        assertEquals(false, awaitItem().isDarkMode)
        viewModel.toggleDarkMode()
        assertEquals(true, awaitItem().isDarkMode)
        cancel()
    }
}

Compose UI: Use createComposeRule() (from androidx.compose.ui:ui-test-junit4) for UI tests. The failing test asserts that the toggle exists and responds to clicks. Note: Compose UI tests use JUnit 4, while your unit tests may use JUnit 5. If you’re following our Compose side effects guide, make sure your tests account for LaunchedEffect triggers.

@get:Rule
val composeRule = createComposeRule()

@Test
fun darkModeToggle_displaysCorrectState() {
    composeRule.setContent {
        SettingsScreen(
            uiState = SettingsUiState(isDarkMode = false),
            onToggleDarkMode = {}
        )
    }

    composeRule
        .onNodeWithText("Dark Mode")
        .assertIsDisplayed()
    composeRule
        .onNodeWithTag("dark_mode_switch")
        .assertIsOff()
}

Subagent Delegation: One Task, One Agent

Here’s the key insight from the Superpowers methodology: instead of one AI agent building the entire feature (accumulating context and making increasingly confused decisions), you dispatch a fresh agent for each task. Each subagent gets precisely scoped instructions and delivers a focused result.

Orchestrator Agent
Holds the plan, dispatches tasks, reviews results
↓      ↓      ↓
Subagent 1
Repository + tests
Implements → Self-reviews → Commits
Subagent 2
ViewModel + tests
Implements → Self-reviews → Commits
Subagent 3
Compose UI + tests
Implements → Self-reviews → Commits
↓      ↓      ↓
Spec Reviewer
Does code match the plan?
Quality Reviewer
Is the code well-written?

For Android projects, this pattern works exceptionally well because of the layered architecture. Task 1 (Repository) has zero dependencies on Task 4 (ViewModel), so a subagent can implement the repository without knowing anything about the UI. Each subagent follows the TDD cycle: write failing test, implement, refactor, commit.

After each subagent delivers, two review passes happen: a spec compliance review (does the code match the plan?) and a code quality review (is it well-written, does it follow the project’s patterns?). Only after both reviews pass does the orchestrator move to the next task.

Phase 5: Systematic Debugging for Android

Iron law: No fixes without root cause investigation first. If 3+ attempts fail, stop and question the architecture.

When tests fail or crashes appear during implementation, the instinct is to start guessing. The agent tries changing a parameter, adding a null check, wrapping something in a try-catch. Three attempts later, the code is worse than before.

Systematic debugging follows a strict four-phase protocol:

Phase 1: Investigate
Read the full stack trace. Check Logcat output. Identify the exact line and exception. Reproduce consistently. Check git diff for recent changes. For Android-specific issues: check ProGuard/R8 rules in proguard-rules.pro, verify Hilt component hierarchy, inspect Room schema migrations, review Compose recomposition with Layout Inspector.
Phase 2: Pattern Analysis
Find similar working code in the project. Compare the broken code with the working version line by line. List every difference. For Android: compare with a working Hilt module if DI fails, compare with a working Screen composable if UI crashes, compare with a working DAO if Room queries fail.
Phase 3: Hypothesis
Form a specific, written hypothesis: “The crash occurs because the ViewModel is injected before the SavedStateHandle is available.” Test with the smallest possible change. One variable at a time. If it doesn’t work, form a new hypothesis — don’t pile on fixes.
Phase 4: Targeted Fix
Write a failing test that reproduces the bug. Implement the single fix that addresses the root cause. Verify no other tests broke. For Android: run ./gradlew connectedDebugAndroidTest after any fix that touches UI or database code, not just unit tests.

If you’ve read our guide on rubber duck debugging with AI, that post covers the interactive, conversational side of debugging. This phase is different — it’s a formalized protocol that an autonomous agent follows without human hand-holding. The two approaches complement each other: use rubber duck debugging when you’re stuck personally, and systematic debugging when the agent is working autonomously.

Phase 6: Code Review — Two-Stage Gate

Iron law: Every task gets reviewed for spec compliance AND code quality before being marked complete.

The code review phase uses a separate reviewer agent — a fresh context that hasn’t seen the implementation process, only the result. For Android projects, the reviewer checks:

Spec compliance: Does the code match the plan exactly? If the plan said “use DataStore for theme persistence,” did the implementation actually use DataStore, or did it drift to SharedPreferences? Are all the planned tests written?

Architecture compliance: Does the code follow the project’s established patterns? For Android, this means: is the ViewModel using unidirectional data flow? Is business logic in the UseCase, not the ViewModel? Are Compose functions stateless where they should be? Is the Hilt module structured like existing modules?

Android-specific checks: Are lifecycle concerns handled properly? Does the ViewModel survive configuration changes? Are coroutines scoped correctly (viewModelScope, not GlobalScope)? Are Compose side effects using the right effect handler? Is the new code compatible with the existing navigation graph?

The reviewer outputs structured feedback categorized as Critical (must fix), Important (should fix before merging), or Suggestion (nice to have). The implementing agent addresses critical and important items, then the reviewer verifies the fixes.

Phase 7: Verification — Evidence or It Didn’t Happen

Iron law: No completion claims without fresh verification evidence.

This is the simplest phase and the most commonly skipped. Before the agent can say “done,” it must:

  1. Run the full test suite: ./gradlew testDebugUnitTest
  2. Run connected tests if UI was changed: ./gradlew connectedDebugAndroidTest
  3. Verify the build succeeds: ./gradlew assembleDebug
  4. Read the actual output and confirm zero failures
  5. Only then report completion with the evidence

The key word is fresh. Not “the tests passed earlier.” Not “they should still pass.” Run them again, right now, and show the output. For Android, this also means verifying that Hilt’s component generation succeeded (a common silent failure) and that Room’s schema hash matches.

Prompt Templates You Can Use Today

You don’t need to install any plugin to start using this methodology. Here are prompt templates you can paste directly into Claude Code, Cursor, or any AI coding agent. Adapt the project-specific details to your codebase.

Brainstorming prompt:

Before writing any code, explore this Android project's architecture:
1. Read build.gradle.kts files, the navigation graph, and DI modules
2. Ask me ONE clarifying question at a time about the feature
3. Propose 2-3 approaches with trade-offs specific to this codebase
4. Write a short design doc covering: data flow, state ownership,
   new/modified components, and affected tests
5. Wait for my approval before proceeding to implementation

Feature request: [describe your feature]

TDD implementation prompt:

Implement this task using strict TDD (RED-GREEN-REFACTOR):
1. Write a failing test FIRST that defines the expected behavior
2. Run the test and confirm it FAILS for the expected reason
3. Write the MINIMAL code to make the test pass
4. Run ALL tests and confirm they pass
5. Refactor if needed while keeping tests green
6. Commit after each green cycle

Task: [paste the specific task from your plan]
Test file: [path/to/test/file]
Implementation file: [path/to/source/file]
Verify with: ./gradlew :module:testDebugUnitTest

Debugging prompt:

Debug this issue using systematic root cause analysis:
1. INVESTIGATE: Read the full error/stack trace. Check git diff.
   Reproduce the issue consistently.
2. ANALYZE: Find similar WORKING code in this project. List every
   difference between working and broken code.
3. HYPOTHESIZE: Write a specific hypothesis. Test with ONE change.
4. FIX: Write a failing test that reproduces the bug, then fix it.

STOP after 3 failed attempts and reassess the architecture.

Error: [paste the error or describe the bug]

These templates work with any AI coding agent. If you’re already using Claude Code skills for Android, you can save these as reusable skills that trigger automatically based on context.

Real-World Example: Adding Offline Sync to an Existing App

Let’s walk through how this methodology plays out with a real feature. You have an Android app that loads data from a REST API, and you want to add offline support using Room as a local cache.

Brainstorming produces: A design where the Repository becomes the single source of truth, Room stores cached responses, and a SyncManager using WorkManager handles background refresh. The ViewModel doesn’t change its API — it still calls repository.getItems(), but now that method checks Room first and fetches from the network only when needed. Design approved.

Planning produces 6 tasks: (1) Room entity and DAO, (2) migration from current schema, (3) Repository refactor to cache-first pattern, (4) SyncManager with WorkManager, (5) Hilt module updates, (6) error state handling in ViewModel. Each task has exact file paths, test expectations, and Gradle verification commands.

Implementation via subagents: Task 1’s subagent writes a Room entity test asserting insert/query behavior, implements the entity and DAO, and commits. Task 2’s subagent writes a migration test using MigrationTestHelper, implements the migration, verifies the schema hash. And so on. Each subagent works in isolation, follows TDD, and gets reviewed.

Verification: After all tasks complete, the orchestrator runs the full suite — unit tests, Room migration tests, connected tests — and confirms zero failures. The feature branch is ready for PR.

If you’ve followed our post on offline-first Android with Room and Ktor, you already know the technical patterns. This methodology adds the process discipline that ensures those patterns are implemented correctly by an AI agent.

When to Use This and When It’s Overkill

This full seven-phase pipeline is designed for features that touch multiple layers of your architecture — a new screen, a data sync system, a payment flow, a migration. These are the tasks where AI agents cause the most damage when left undisciplined.

For simpler tasks — renaming a variable, fixing a typo, adding a single field to an existing data class — skip the ceremony. The methodology is a tool, not a religion. Use the brainstorming phase when you’re unsure about the approach. Use TDD when the code has meaningful behavior to verify. Use subagents when the feature has 3+ independent tasks. Use the full pipeline when you’re shipping something that matters.

The goal isn’t to slow down your AI agent. It’s to make its output trustworthy enough that you can review a clean PR instead of rewriting half the code it generated. Done right, you’ll ship better features faster — not because the AI writes more code, but because it writes the right code the first time.


This post was written by a human with the help of Claude, an AI assistant by Anthropic.

Scroll to Top