Insights      Artificial Intelligence      Agile AI at Georgian – Part 3: Experimentation and Effort Allocation

Agile AI at Georgian – Part 3: Experimentation and Effort Allocation

Welcome back to Agile AI at Georgian, where I share lessons learned about how to adapt agile methodologies for AI products. In previous installments, we’ve talked about finding your project’s North Star and motivating your team. Today, I want to talk about one of the areas that often trips up teams when trying to do agile for AI: experimentation and effort allocation.

AI product development is very close to academic research in many ways. This means that you will be overseeing experiments, kind of like a principal investigator. The process will be iterative, fraught with failure — and extremely rewarding! So how do you take such an uncertain process and make it work with agile?

Get your hands dirty

With so many different AI frameworks on the market, it can be tempting to embrace off-the-shelf algorithms and packages. But it’s important to make sure your initial experiments are involved enough that you really get into the data yourself as well. This will pay dividends by giving your team a deeper understanding of the problem space, better hypotheses and the ability to make more accurate estimations.

At Georgian, we often try first to develop a high quality, domain-tuned model. In this phase, we focus first on getting high precision results for a subset of the problem and don’t shy away from manual, technically simple solutions like using regular expressions or rules based on lists of words. We can then use that performance as a baseline when we work on scaling up our efforts and achieve higher recall.

INFOGRAPHIC #1: Get your hands dirty with data and experiments

Left side: 
Ensure initial experiments are involved enough that you get into the data yourself, even if off-the-shelf AI frameworks are tempting.
Being involved in initial experiments can give teams better data intuitions and more accurate estimations.

Right side: 

At Georgian, we often first try to develop a high-quality, domain-tuned model.

Focus first on getting high precision results for a subset of the problem.
Don’t shy away from manual, technically simple solutions.
Use the performance of simple solutions as a baseline when scaling up efforts and increasing recall.
Get your hands dirty with data and experiments

Use an experimental design framework

I strongly recommend using a formal experimental design framework that supports the full scientific lifecycle. This will help set some guardrails, get the team moving in the same direction, and help data scientists communicate their findings. There are many experimental design frameworks you can choose from so that you don’t have to start from scratch — here’s one example I’ve found particularly useful. 

Think of your framework like a contract between the product owner and the data scientist that spells out specifically which assumptions are being tested, with clear metrics and performance thresholds. It should include acceptance criteria as well as stopping criteria and define the actions that will be taken based on the results of the experiments. You’ll need to make sure your PM is ready to proactively prune experiments and eliminate avenues of research that have a low probability of bearing fruit. 

At Georgian, we try to map out two to three experiments in each key area ahead of time, and define clear go/no-go criteria, a decision tree of sorts with logic that everyone agrees to up front. We also frame each experiment as yielding either concrete learnings or answers to specific questions. That way, regardless of the outcome, each question answered feels like progress in understanding the space. 

For instance, we are building a market tagging system to automatically contextualize companies. We first gathered ideas from the whole team and identified several possible pathways, e.g. building our own classifiers with the help of in-house experts, unsupervised clustering, using third-party tags, utilizing zero-shot learning. All of our scientists came together for a hackathon to quickly test out each idea. Here, we discovered that if we’d like to build our own classifiers, one multi-label classifier performs better than individually trained binary classifiers. We also learned that there is a promising output from each avenue and set out to investigate further how best to mix the different benefits of each idea together.

When planning your sprints, be sure to allocate plenty of time for experimental design. It may take the better part of a sprint to do it well. Designing a good experiment requires probing data sets, allowing for testing run times, and confirming availability of infrastructure.

INFOGRAPHIC #2: Use an experimental design framework

Left side:

Use a formal experimental design framework that supports the full scientific lifecycle.
Frameworks should specifically spell out which assumptions are being tested, with clear metrics and performance thresholds.
Frameworks should include acceptance criteria and stopping criteria, while defining the actions that will be taken based on the results of the experiments. 

Right side

At Georgian, we try to map out two to three experiments in each key area ahead of time.

We first gather ideas from the whole team and identify several possible pathways.
We gather all of our scientists into a hackathon to quickly test out each idea. 
We select the most promising solution and often combine solutions based on preliminary results from the hackathon.
Use an experimental design framework

Getting good at effort estimation

Using an agreed upon scientific process, getting people comfortable with that process, and understanding your particular team’s adoption of that process will make things more predictable, from knowing what steps come next, to identifying early warning signals, to anticipating data needs.

I call this finding your rhythm. 

The hardest part of this for most teams is accurately estimating the effort required for experiments, so that sprints can run smoothly and predictably and so that experiments can be properly prioritized by effort. Some of this is to be expected — after all, science is messy and includes tons of unknowns. But we have found that it’s possible to get better and better at estimating and timeboxing efforts. Here are some of the ways we do so at Georgian:

  • Prioritize experiments up front based on level of effort, feasibility, predictability of a particular outcome and data requirements. While this can be challenging to estimate accurately, we iterate on each estimate as more data, outcomes and information are gathered through experimentation. The more we flex this important muscle, the more accurate we get. 
  • Modularize the work and develop “t-shirt sizing” skills for experiments.
  • Start with frequent check-ins and pre-retrospectives at first, then adjust over time. 
  • Find the right format for reporting and information-sharing that’s informative, but lightweight enough that it doesn’t become a chore that distracts from the work. Leveraging the right tooling here can help your team stay lean and support transparent, asynchronous work. For example, at Georgian, we’ve used tools like Comet and Wandb, which enable data scientists and all relevant stakeholders of the project to consistently and efficiently track, compare, explain and visualize experiments across the model’s entire lifecycle.
  • Our intuition tells us that optimizing our algorithm will take far and away the greatest amount of time, but as experienced AI PMs will tell you, that’s often not the case. (This is sometimes referred to as “the machine learning surprise.”) Instead, dealing with the data and infrastructure necessary to support the algorithm, and then integrating the results into your product suite often require more time than expected. Remembering to add some extra time for these components has helped our sprint planning become more accurate.
INFOGRAPHIC #3: Finding your rhythm in sprints

The hardest part of this for most teams is accurately estimating the effort required to do experiments so that sprints can run smoothly and predictably.

Prioritize experiments up front based on level of effort, feasibility, predictability of a particular outcome and data requirements. 
Modularize the work and develop “t-shirt sizing” skills for experiments.
Start with frequent check-ins and adjust over time.
Find the right format for reporting and information sharing that’s informative.
Finding your rhythm in sprints

Get your team out of a rut

Sometimes, despite your best efforts and carefully planned experiments, you’ll reach a dead end. At Georgian, for instance, when we were working on our market taxonomy, we found ourselves stymied by a challenging problem. The team was stuck and morale started to suffer. In this situation, we often have had success temporarily suspending our normal processes and going into hackathon mode. 

Hackathons offer several benefits. First, you have multiple team members working in a highly focused, collaborative environment without time wasted on day-to-day task switching. Hackathons have also helped us very quickly prune away certain development/experimental pathways to save time. 

The more diverse the team you can assemble for your hackathon and the more ideas and techniques you can try, the more likely you are to land on an unexpected but effective solution. For our hackathon we combined our applied research team with application team data scientists, which allowed us to leverage cutting edge research and academic trends in our solutions. Our engineering team also participated, which allowed us to use in-house MLOps tools, such as Hydra, to run and manage more experiments much more efficiently. Hydra makes it possible for our data scientists to do large-scale cloud-agnostic machine learning experimentation, and we recently open-sourced it so that companies can take code from their local machines and run them on much larger and powerful cloud machines through a simple configuration file

We also bring in team members from outside our usual pod — individuals who have fresh eyes and are new to the problem and context. 

If you go the hackathon route, make sure to leave some time to get the hackathon work into production. 

INFOGRAPHIC #4: Stuck in a rut? Consider hackathons

When teams face challenging problems, they can get stuck and morale can suffer.

Hackathons allow teams to work in a highly focused, collaborative environment without time wasted on day-to-day task switching. 
The more diverse the team you can assemble for your hackathon, the more ideas and techniques you can try.
Leave some time to get the hackathon work into production.
Stuck in a rut? Consider hackathons

This is the third in a series on agile AI. If you would like to receive the rest to your inbox, sign up for our newsletter here.

Next time, in our Agile Series, I’ll talk about the data lifecycle.

Part 1: Finding Your Project’s North Star, Part 2: Nurturing your AI Team and Part 4: How Data Can Make or Break Your AI Project

This article was originally published on Georgian’s Medium.

Read more like this

Testing LLMs for trust and safety

We all get a few chuckles when autocorrect gets something wrong, but…

How AI is redefining coding

Sometimes it’s hard to know where to start when it comes to…