Building Trip Threads: The NLP Journey From Regex to LLMs

How I went from 46% to 94% accuracy by evaluating four parser architectures and and approaching NLP as a system with measurable trade-offs.

Why This Matters

I've been building Trip Threads, a conversational travel planner designed to solve the coordination chaos of group trips. Natural language input quickly became the defining feature of the platform with the hypothesis being that if users could type things like "Dinner €60 split 4 ways" or "Flight to Paris Monday 9am" and the system just handled the rest, the experience would be dramatically faster than traditional forms.

But implementing a parser that works across messy real-world input turned out to be one of the most interesting systems problems in the entire project.

Building a Repeatable Evaluation Framework

Before evaluating different approaches, I needed a consistent way to measure accuracy. I developed a test suite of 90 sample inputs that represented the natural language I expected users to enter:

"Dinner at Mario's €45 split 3 ways"
"Flight tmr 9am"
"Hotel checkout Sunday, Alice paid $200"
"Museum tickets 15 each for 4 people"

These covered the typical use cases: expenses, itinerary items, split logic, dates, and participant tracking. I intentionally included variations in:

Formality: "tomorrow" vs "tmr" vs "next Thursday"
Currency formats: "€60" vs "60 euros" vs "60EUR"
Split expressions: "split 4 ways" vs "15 each" vs "Bob owes 40%"
Date references: Absolute dates, relative dates, and implicit timing

To evaluate each parser, I defined an error taxonomy:

Amount errors
Currency errors
Date/time errors
Participant/split errors
Structural interpretation errors

Every parser was run against identical inputs, and accuracy was measured using this taxonomy. This made the trade-offs much clearer than a simple pass/fail system.

In the future, I plan to expand this with real user inputs to ensure broader coverage of input types (formal, casual, regional variations, etc.). But for the MVP, this test suite gave me a reliable benchmark to compare different parsing approaches.

Architecture Decisions

When evaluating parsing solutions, I optimized for several constraints:

Technical Complexity: Keep it simple.

Cost: Minimize ongoing expenses, especially during development and early user testing.

Latency: Users expect their inputs to be processed instantly without breaking the conversational flow.

Development Speed: I needed to move fast and validate the concept before over-investing in infrastructure.

Input Data Constraints: With limited real-world user data, I couldn't train custom models effectively. This pushed me toward off-the-shelf solutions.

Privacy – Especially important since users share trip information.

These constraints shaped my evaluation criteria.

Version 1: Regex Parsing

Fast, simple… and not sufficient for natural language.

Regex was the obvious starting point: fast, deterministic, and easy to run on-device. Perfect for the MVP mindset.

In theory, patterns like /(\\d+\\.?\\d*)\\s*(€|\\$|EUR|USD)/ should catch currency amounts. But real users don't write textbook sentences.

What Broke

Typos: "60 euro" vs "60 euros" vs "60€"
Order variations: "Split €60 four ways" vs "€60 split 4"
Context-dependent phrases: "Dinner with Alice yesterday"
Regional formats: "1.234,56 EUR" vs "$1,234.56"

I could patch individual cases, but the combinatorial explosion was endless.

Accuracy: 42/90 (~46%)

Great for textbook cases. Terrible for real-world use.

Verdict: Not good enough to be the flagship feature.

Version 2: Deterministic NLP Parser

A more sophisticated rules-based system with reinforcement learning.

I leveled up to a proper parser using:

Tokenization and custom heuristics
chrono-node for date extraction
Currency normalization rules
Split-type detection logic
Reinforcement learning mechanism

This felt like real engineering. I built 100+ passing unit tests. The parser was fully offline, deterministic, and blazing fast (<10ms).

Reinforcement Learning Layer

I incorporated a reinforcement learning mechanism where users could provide feedback when the parser got it wrong. The workflow:

User enters natural language input
Parser generates a result
User confirms if correct, or corrects any errors
Corrections are scored and used to rank different parsing rules
Next time a similar phrase appears, the highest-ranked rule is applied

What Worked

Fully offline
Extremely fast (<10ms)
Predictable and debuggable

The idea was to catch more edge cases over time without manually coding every variation. However, this grew complex quickly:

Insufficient data: Without enough user corrections, the scoring wasn't meaningful
Rule explosion: The number of competing rules grew faster than the learning could optimize
Debugging difficulty: It became hard to understand why certain rules won over others

While theoretically promising, the reinforcement learning approach needed significantly more input data to become useful.

What Still Broke

Real-world travel messages are just too varied:

"Taxi 45 with John owes me later"
"Marriott 15–20 Dec €200"
"Museum tmr afternoon"
"Brunch yesterday 30 each"

Even with reinforcement learning, complexity grew without improving overall reliability. Edge cases multiplied faster than I could address them.

Accuracy: 55/90 (~61%)

Better than regex — but still not good enough to be a flagship feature.

Verdict: Rule-based systems don't scale to real-world language. The long tail of expressions is endless, and reinforcement learning requires more data than I had available.

Version 3: On-Device LLMs

Promising accuracy, unusably slow without hardware acceleration.

To improve accuracy without sending data to external services, I tried running small LLMs locally:

Phi-3-mini
Mistral 7B
Llama 3.2

These ran on a MacBook Air (CPU-only) using Ollama

What Worked

Better flexibility than deterministic parsing
Solid privacy guarantees
No server dependency
Handled ambiguity gracefully

What Didn't

Latency: Over 30 seconds per request

For interactive UI, this was a deal-breaker. Users expect instant feedback, especially when typing conversationally. Waiting 30 seconds for a preview completely destroyed the product experience.

Accuracy: ~60–70/90

Verdict: On-device LLMs aren't suitable for my use case since without hardware acceleration or quantization, latency is unacceptable for real-time experiences.

Version 4: Hosted LLM (GPT-4o-mini)

The first approach that actually felt "right."

I integrated GPT-4o-mini through a server-side API route with structured JSON output.

This version delivered:

High accuracy across diverse input styles
Robust parsing of dates, currencies, amounts, and participants
Handling of typo-ridden or ambiguous input
Ability to parse "dual" items (e.g., hotel = expense + stay)
Significantly reduced maintenance burden

The key architectural decision was using a preview → confirm workflow:

User Input
   ↓
AI Parser (GPT-4o-mini)
   ↓
Structured JSON
   ↓
UI Preview (user can edit)
   ↓
User Confirms
   ↓
Database Save

This architecture naturally absorbs the ~1 second latency.

Latency: 500–1500ms

Acceptable for a preview flow.

Accuracy: 85/90 (~94%)

A dramatic improvement over all other approaches. The parser now handled:

Typos and colloquialisms
Relative dates ("tomorrow", "next Thursday")
Multi-currency expressions
Complex split logic
Ambiguous phrasing

Cost & Privacy Tradeoffs

Cost:

~$0.00005 per parse
~10,000 parses/month → ~$0.50
~100,000 parses/month → ~$5.00

For the UX benefits and accuracy gains, this was an easy decision. Even at scale, the cost remains negligible compared to the value delivered.

Privacy Considerations:

Moving to a hosted LLM meant sending user input to OpenAI's servers. To address privacy concerns:

Data Sanitization: User input is sanitized before sending to the API — removing any PII (personally identifiable information) when possible
Structured Output: The API returns only classification data and intent mapping as JSON, not conversational responses
No Training: Data sent to OpenAI's API is not used to train their models (per their API terms)
Preview Flow: Users see and approve the parsed result before it's saved to the database
Reduced data transfer: Only the minimum required text is sent to the model

While not as private as on-device processing, this approach balances privacy with usability. For a travel coordination app where users are already sharing information with their group, the tradeoff was acceptable.

Verdict: Accuracy was the defining feature. For Trip Threads, correctness is non-negotiable. A fast but wrong parser creates more frustration than value. The cost is negligible, and privacy concerns are mitigated through data handling practices.

The Comparison

Parser comparison across four approaches showing accuracy, latency, and cost tradeoffs

What I Learned

Building this parser ended up being a full AI systems evaluation exercise — not just a feature implementation.

Key Lessons:

Accuracy beats speed when the feature is core to the product.
Rule-based systems don't scale to real-world language.
On-device LLMs are compelling, but require hardware-aware deployment.
A repeatable evaluation framework is essential for AI features.
Think like a systems PM.
Architectural context matters as much as model choice.

This project reinforced that AI features must be treated as systems with measurable KPIs and evolving constraints and not just as one-off model integrations.

What's Next

I'm exploring a hybrid inference strategy that balances accuracy, latency, cost, and privacy:

Hybrid inference strategy flowchart showing decision tree from user input through caching, pattern detection, custom LLM, and fallback to OpenAI, with user confirmation and training feedback loops

Future Architecture Goals:

Custom fine-tuned LLM trained on user corrections and Trip Threads-specific patterns
OpenAI fallback for edge cases where custom model has low confidence
Client-side deterministic parsing for simple, high-confidence patterns (e.g., "€50")
Result caching to improve speed and reduce cost
Offline fallback using deterministic parsing when network unavailable
Continuous learning from user corrections to improve the custom model

Over time, as we collect sufficient user inputs and corrections, we can train a custom model specifically for travel planning language. This would reduce dependency on external APIs, lower costs, and improve privacy — while maintaining the accuracy benefits of LLM-based parsing.

Closing Thoughts

Trip Threads started as a way to scratch my own itch around group travel planning. But the NLP parser became the most interesting part of the journey.

It forced me to think deeply about product trade-offs: accuracy vs. speed, privacy vs. capability, simplicity vs. robustness.

And it reminded me that the best solution isn't always the most technically elegant one — it's the one that delivers the best user experience within real-world constraints.

If you want to go deeper into the implementation, test suite, or architecture, feel free to reach out.

This post is part of my ongoing work on Trip Threads — a conversational travel planner built to make group coordination feel more human.

Building Trip Threads: The NLP Journey From Regex to LLMs

Why This Matters

Building a Repeatable Evaluation Framework

Architecture Decisions

Version 1: Regex Parsing

What Broke

Accuracy: 42/90 (~46%)

Version 2: Deterministic NLP Parser

Reinforcement Learning Layer

What Worked

What Still Broke

Accuracy: 55/90 (~61%)

Version 3: On-Device LLMs

What Worked

What Didn't

Accuracy: ~60–70/90

Version 4: Hosted LLM (GPT-4o-mini)

Latency: 500–1500ms

Accuracy: 85/90 (~94%)

Cost & Privacy Tradeoffs

The Comparison

What I Learned

Key Lessons:

What's Next

Closing Thoughts

Related Projects

TripThreads