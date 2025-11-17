Data curation

Data is perhaps the fundamental building block of an AI application — particularly one that’s customer-facing. However, although the core set of data was relatively straightforward — information about the conference and schedule, for instance — there was still significant curation required to create an assistant that met our initial requirements.

One important step was anticipating questions from attendees to create an FAQ feature for the chatbot — this involved us essentially building up a data set by sourcing common questions from attendees at other XConf events around the world. To ensure this was appropriate and relevant to our Madrid audience, we then tailored the data (with relevant responses included).

Data minimization

Part of what helped us complete the project so quickly was remaining true to a principle of data minimization — in other words, collecting and processing as little as possible. For instance, no usage analytics was collected at all. While this obviously limited what we could learn about user behavior it nevertheless meant we could do two things: first, focus on the experience of the application when launched for the event, and second, ensure full compliance with data privacy regulations (in this case GDPR).

This principle also led us to specifically avoid collecting private or personal information. For example, we:

Instructed users explicitly not to use personal information (ie., to use nicknames instead).

Decided to not store any conversations on our servers — only on the user’s device.

Gave users the option to clear all locally stored data.

Helping the LLM retrieve data effectively

If data was the foundation of the chatbot, the LLM and its ability to retrieve data was crucial in ensuring the interaction layer of the application was effective — and, to return to our goal, actually improved the attendees' experience of the event. A clunky chatbot, after all, would be worse than no chatbot at all.

We used an older model for this project: Claude-Sonnet 3.5. This was admittedly a conservative choice but it was made because of the time constraint — using something more ‘proven’ meant we could move faster as a team. We also used RAG as a retrieval technique as we believed this would improve the performance of the application. On reflection, however, this was probably unnecessary given the project didn’t consist of a huge amount of data.

In fact, as the project progressed our use of RAG declined in scope. Originally, we thought we needed to use external RAG libraries and set up a RAG database — either a custom one or on Amazon RDS for PostgreSQL. However, because the dataset was so small, we handled the embedding generations ourselves and saved the embeddings to JSON files on S3. By simplifying our solution, it reduced our infrastructure footprint — great news for both our budget and delivery timeline.