🧠 Project: AI-Powered Simplification of the U.S. Tax Code

Repository: github.com/candrasick/ai_tax_agent

Overview

In this capstone project, I built a suite of AI agents using Google’s Gemini 1.5 Pro to tackle one of the most complex regulatory systems in existence: the U.S. tax code. The goal was ambitious β€” to simplify the legal text, maintain revenue neutrality, and surface low-impact, high-complexity candidates for deletion or rewrite.

The first full pass through 2,100+ sections resulted in a 60% reduction in length, compressing over 7,000 pages to fewer than 3,000. And this is just the beginning.

🎬 Video Summary

You can find a video summarizing the project here:

youtu.be/QEbOICTW3C8


πŸ” Key Features and Capabilities

🧾 Exemption Extraction

Gemini was used to parse each section of the U.S. Code and extract tax exemptions β€” often deeply embedded or cross-referenced β€” as discrete entities. This enabled clearer mapping of tax advantages and loopholes to statutory sections.

πŸ§ͺ Integration Testing of Heuristics

Using Gemini as a cognitive test harness, I verified custom logic against form metadata and code references. This allowed for automated regression checks during agent design and helped tune the heuristics used for section-to-stat linkage.

πŸ“Š Parsing Unstructured PDF Statistics

Gemini’s multimodal capabilities were used to convert IRS statistics β€” often locked in PDF tables β€” into structured JSON. This enabled dynamic linkage between form line items and statutory authority, empowering more accurate impact estimation.

πŸ“‰ Complexity Scoring

Each section was scored based on:

  • Legal language complexity (measured directly)
  • Z-scores for section length
  • Amendment count
  • IRS bulletin references

Gemini synthesized these into a final complexity_score, which guided subsequent rewrite decisions.

πŸ’° Impact Cataloging

The agent estimated financial and entity impact using:

  • Tax statistics from the IRS
  • Line item instructions
  • Linked form references

This allowed the agent to identify high-impact areas of the code and prioritize retention or careful editing.

🧭 LLM-Grounded Decision Making

The agent made edit decisions (simplify, redraft, delete, keep) by reasoning across both complexity and impact. Sections with low complexity and low impact were often deleted. High-impact sections were simplified or redrafted conservatively.

ChromaDB was used to semantically retrieve related sections, exemptions, and form metadata. This gave the LLM critical context and allowed deeper insight into how a section connects across the tax ecosystem.


πŸ’‘ Results

  • Reduced text length by ~60% in the first pass.
  • Preserved or improved readability across over 1,000 sections.
  • Maintained revenue neutrality within acceptable bounds.
  • Established an agentic feedback loop that can continue refining the code across future iterations.

πŸ’Ύ Open Source

All code, prompts, and data are open-sourced and available at:

πŸ‘‰ github.com/candrasick/ai_tax_agent