AI-Powered Simplification of the U.S. Tax Code
π§ Project: AI-Powered Simplification of the U.S. Tax Code
Repository: github.com/candrasick/ai_tax_agent
Overview
In this capstone project, I built a suite of AI agents using Google’s Gemini 1.5 Pro to tackle one of the most complex regulatory systems in existence: the U.S. tax code. The goal was ambitious β to simplify the legal text, maintain revenue neutrality, and surface low-impact, high-complexity candidates for deletion or rewrite.
The first full pass through 2,100+ sections resulted in a 60% reduction in length, compressing over 7,000 pages to fewer than 3,000. And this is just the beginning.
π¬ Video Summary
You can find a video summarizing the project here:
π Key Features and Capabilities
π§Ύ Exemption Extraction
Gemini was used to parse each section of the U.S. Code and extract tax exemptions β often deeply embedded or cross-referenced β as discrete entities. This enabled clearer mapping of tax advantages and loopholes to statutory sections.
π§ͺ Integration Testing of Heuristics
Using Gemini as a cognitive test harness, I verified custom logic against form metadata and code references. This allowed for automated regression checks during agent design and helped tune the heuristics used for section-to-stat linkage.
π Parsing Unstructured PDF Statistics
Gemini’s multimodal capabilities were used to convert IRS statistics β often locked in PDF tables β into structured JSON. This enabled dynamic linkage between form line items and statutory authority, empowering more accurate impact estimation.
π Complexity Scoring
Each section was scored based on:
- Legal language complexity (measured directly)
- Z-scores for section length
- Amendment count
- IRS bulletin references
Gemini synthesized these into a final complexity_score
, which guided subsequent rewrite decisions.
π° Impact Cataloging
The agent estimated financial and entity impact using:
- Tax statistics from the IRS
- Line item instructions
- Linked form references
This allowed the agent to identify high-impact areas of the code and prioritize retention or careful editing.
π§ LLM-Grounded Decision Making
The agent made edit decisions (simplify, redraft, delete, keep) by reasoning across both complexity and impact. Sections with low complexity and low impact were often deleted. High-impact sections were simplified or redrafted conservatively.
π§ ChromaDB Vector Search
ChromaDB was used to semantically retrieve related sections, exemptions, and form metadata. This gave the LLM critical context and allowed deeper insight into how a section connects across the tax ecosystem.
π‘ Results
- Reduced text length by ~60% in the first pass.
- Preserved or improved readability across over 1,000 sections.
- Maintained revenue neutrality within acceptable bounds.
- Established an agentic feedback loop that can continue refining the code across future iterations.
πΎ Open Source
All code, prompts, and data are open-sourced and available at: