Matthew Burke’s Blog

Distribution Drift Tolerance

2023-09-28T00:00:00+00:00

Does PSI Have Reliable Thresholds?

I’ve written before about the use of PSI (population stability index) to measure population drift and provided some guidelines on interpretation of scores taken from this resource. I recently began a project involving model monitoring and began to question how the authors of that paper derived their thresholds to define slight, minor and significant changes, and have so far received no answers. This seems like a huge gap for an entire industry that has science in the name to be following a standard that has no formal basis.

To remedy this, I have created a jupyter notebook that allows the user to run through concrete examples of comparing distributions, evaluating them for drift, and creating a decision boundary representing the summary of their preferences.

Data Generation

In order to provide distributions to compare, I chose to use a skewed normal as a familiar but not quite standard distribution DS/ML practitioners may encounter in their jobs. I first generate random parameters for skew, center and scale of the distribution for the base distribution, and then modify each of those parameters by random value ±25% from the original. This gives a base and alternate distribution that we are treating as the original population, and one later in time.

We then sample each distribution a number of times to create our final histograms that will be reviewed by the user. For each set of samples, we can calculate the true PSI value and store it until the user has time to evaluate.

Evaluation Loop

When using the notebook, a single cell will show the plotted distributions against one another, but the true PSI value is not shown.

The user must then click on one of two buttons to label the two populations as having acceptable or unacceptable drift. After clicking, their evaluation and corresponding PSI value are logged to a list, and the process repeats by replacing the plot with two new distributions and their hidden PSI value.

The user can continue this process as many times as they like until they feel satisfied that they have provided enough sample data to work with.

Preference Estimation

What we have done here is created a labeled dataset that encodes the user’s intuition on how much difference in two distributions constitutes an unacceptable risk in terms of population drift. We then can learn a logistic regression model to quantitatively measure that relationship. I chose to use PyMC to build the model because of the small data sizes and noisiness in measurements (more below), but you could use anything. Once this model has been learned, we can plot our observations alongside the curve from our learned beta parameter to compare our expectations of how we feel about risk to the measured reality. You can take a look at my results in the following chart:

As we can see, this doesn’t look like your usual S curve. If we zoomed out significantly, we would see the shape, but as is, due to overlap between our labels, our model doesn’t have the nice, clear-cut decision boundary we were hoping for.

Conclusions

Drift is subjective from person to person, and maybe even more so within an individual. Humans are bad at probability and most visual mathematical intuition that exceeds a few data points.
Reliance on industry benchmarks without appropriate supporting data can leave you and your organization vulnerable to hidden mistakes and overconfidence.
Folks working in the data industry need to challenge assumptions and if datasets do not exist, seek them out or curate them yourself.

I can imagine that going through this exercise with key members of your model validation/risk management leadership may prove insightful, and if you want to get even fancier, it might be fun to build a hierarchical logistic model to capture the risk preference of the organization overall as well as being able to compare individuals’ preferences.

Resources

Drift Tolerance Repo

PyMC Wrapper

2023-02-23T00:00:00+00:00

Overview

Bayesian modeling can be super valuable for capturing uncertainty and leveraging the use of prior distributions for new products/geos/etc. I’m relatively new to the space, but in my role of machine learning engineer, I found the tools to be very focused on the science and less so on deployment. While the science part is absolutely critical and must be handled thoughtfully and methodologically, there are different sets of concerns when managing a suite of models in production.

I build a quick POC python library called pymc-wrapper to align PyMC’s amazing Bayesian modeling capabilities with the ease of use of scikit-learn’s fit/predict paradigm. There certainly is more work to do to build out a robust system, but I think for simpler modeling projects that require loading, saving and prediction across a number of models, this package could simplify workflows.

Assumptions

Because the goal wasn’t to capture all cases and mostly to prove out the idea, I incorporated several assumptions:

There is a single function that relates all independent variables to our dependent variable
The errors are normally distributed
The model is non-hierarchical
All data preprocessing is handled beforehand

Configuration Driven

The main idea behind the package is to define your independent variables and PyMC distributions beforehand in a config file and leave the model generation and sampling hidden within the other functions. The configuration would be defined as follows:

independent_vars: a list of the names of each independent variable
sample_params: a dictionary of parameters to be passed into the pm.sample function
variable_params: a dictionary of PyMC variables to be used in the model
- variable_dict: a dictionary defining a specific variable definition with its name as the variable_params key
  - dist: str of the name of the PyMC distribution (case sensitive) to be used for the variable
  - params: a dictionary of the parameters for the variable and the values to be used as the model’s priors
function_params: a dictionary of function parameters
- function: a python function defining the relationship of independent variables to the dependent variable

Example Configuration

Here is an example config file stored in YAML that could be used to generate a model that attempts to learn product adoption rates according to the negative exponential function.

The independent variable would be the week, and we take both an intercept and lambda_val (chosen to not clash with the reserved lambda keyword in python) parameters to generate our final outcome.

def negative_exponential(week, lambda_val, intercept):
    return 1 - intercept - (1 - intercept) * np.exp(-lambda_val * week)

The configuration reflects this by naming our independent variable and defining our parameters with corresponding pymc.distribution function names and prior values. We can also set our sampling parameters as well.

independent_vars:
  - week
variable_params:
    lambda_val:
        params:
            sigma: 1
        dist: HalfNormal
    intercept:
        params:
            mu: 0
            sigma: 1
        dist: Normal
sample_params:
  draws: 1000

To simplify things, I kept the function definition out of the config and manually set it afterwards with

config['function_params'] = {
    'function': negative_exponential
}

Model Usage

Creation and Training

Once we define and load our configuration from a file into a dictionary, creating a model wrapper object and training is as easy as the following:

model = PymcModel(config)
X = df[[independent_vars]]
Y = df[dependent_var]
model.fit(X, Y)

Prediction

Prediction is easy as well, and similar to above, all we have to do is prepare our test data and pass it to the predict function as we would for many other model libraries. Below is an example of training on a set of curves and predicting on the same timeline to ensure that we have learned the relationship correctly:

Because (IMO) the point of using Bayesian models is to take advantage of their ability to capture uncertainty, I augmented the predict function to take in an alpha parameter, which is used to generate credible intervals for predictions as well, and they are output along the median predictions in the output. Below is an example using an alpha of 0.9 with the same dataset as before:

It’s as easy as that!

Saving/Loading

Like other ML libraries, we often want to save our model for prediction at another time, and I have implemented save_trace and load_trace functions accordingly to facilitate this. As long as we reference the same config file for model creation, we can load a saved trace and get straight to prediction.

trace_file_path = 'trace.pkl'
model.save_trace(trace_file_path)

new_model = PymcModel(config)
new_model.load_trace(trace_file_path)

new_model.predict(X_test)

Creating Config with Posteriors as Priors

One potential use case I’ve found interesting is to learn from a established product/geo and apply the posteriors as priors to a new area. Rather than manually updating configs, I added an export_trained_config function that updates the priors in the original config with posterior means and saves it to an output file. Similar to above, we could consume this updated config file to create a new model object.

Periodic Retraining

One last use case I thought might be interesting to showcase is the ability to call the fit function on multiple sets of data. In the following example, I fitted a model to the adoption rate curves we saw above, exported the trained config file, created a new model with it as priors, and finally did a series of repeated fitting on data in chunks of 4 weeks to see how the predicted curve developed over time. The performance at the beginning isn’t great, as it overpredicts the curve steepness at first, but with the noise early on, it’s not totally unreasonable. Either way, it definitely gets more confident over time as it gains more observed data and stabilizes after a few months of data.

Feedback

While this is definitely a work in progress, I would love for you to check out the repo and the example walkthrough notebook that I used to generate these plots and examples.

If you have any feedback of things you would change, would add or even any thoughts of it something like this is needed at all, I would love to hear from you at my email or github in issues/PRs!

MMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?

2023-01-31T00:00:00+00:00

MMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?

TL;DR

MMMs are a future-proofed tool for measuring marketing effectiveness in a world of increased online privacy, but can be prone to multiple silent failure methods that provide inaccurate results.

What is an MMM?

Media mix modeling is a statistical technique used to understand the effectiveness of different advertising and marketing channels, such as television, print, digital, and so on. The goal of media mix modeling is to determine the optimal allocation of a company’s advertising budget across different channels to maximize the return on investment (ROI). This is done by analyzing historical data marketing expenditures and conversions, and using statistical models to estimate the incremental impact of different advertising channels on conversions.

Businesses use media mix modeling to gain insights into the effectiveness of their advertising campaigns and to make data-driven decisions about where to allocate their advertising budget. By understanding the ROI of different advertising channels, companies can optimize their marketing strategy to maximize their sales and revenue. Additionally, media mix modeling can also help businesses identify underperforming channels and make adjustments to their advertising strategy accordingly. It also can help them to understand the changes of the market and the effectiveness of their strategy.

Privacy Proof?

If you haven’t noticed the incessant cookie tracking messages on every website, the world is heading on a trend of increased data privacy online due to legislation such GDPR or CCPA, and Apple’s iOS 14 privacy changes. What this means for advertisers is that they will receive less and less user impression level data. While your company may still be able to extract aggregate-level insights from data clean rooms, mapping conversions to specific users will no longer be possible. The most obvious impact is removing the possibility of maintaining a multi-touch attribution pipeline to assign credit to user impression for conversions.

Evergreen Data

MMMs, however, don’t rely on user-level data, and instead rely on first party data that companies will always have: namely what money went where. Marketing teams will always have budgets to track what advertising dollars were spent on which channels and campaigns, as well as conversions tracked to report revenue. These are the core data that are fed into an MMM, and as such, I don’t see them disappearing at any point in the future. In some cases, impressions are often preferred over marketing spend as input data to the model due to reducing fluctuations in advertising CPA, but my guess is that advertisers will continue to provide at least a high level estimation of impression magnitude to keep their customers happy.

The Pitfalls

With our options for measurement dwindling, why be pessimistic about MMM? It seems like the one stable methodology that will see widespread use. The main reasons revolve around the fact that it’s a small data problem with a big emphasis on slow but precise data collection, and the potential for silent methodology errors.

Lack of Quality Data Input

Not Enough Data

Data collection can’t be rushed or automated with label training, but is collected naturally over time as your company markets its products or services. For some digital channels/advertisers, you may be able to collect both spend and impressions data down to a daily basis, whereas for traditional non-digital channels such as mail or print, you may be limited to a monthly cadence.

Given that your data inputs must be fed into the model at the least granular level, your company may take years to gain enough data to reliably estimate performance. Assume for example that you want to estimate the carryover and saturation effects at a channel level, and depending on the fit forms you use to model them, this could be from 2-5 parameters for each channel. If you have 10+ channels, you can see how this will take a length amount of time until you even have more data points than parameters, let alone to reduce uncertainty to a reasonable amount.

New channels equally struggle with this problem of small data and extracting seasonal effects. Often, one solution for this is to perform additional testing and apply the derived assumptions rather than inferring parameters.

Not Enough Data Variance

One rookie mistake new marketing divisions may make is to spend proportionally on channels month over month. Although the total spend in a time period may shift, if you spend the same proportions in each channel over time, you end up not having enough variance between channels and end up with an entirely collinear dataset. Any model building after that will run into serious, unresolvable modeling issues. This is just a classic example of the explore/exploit paradigm where you need to add enough randomness to discern between channels while still maximizing profit. Fortunately, there may be natural barriers through ad planning and payment that prevent this issue entirely, but it’s something to monitor from a central planning team.

Missing Covariates

Marketing spend only accounts for a portion of all conversions with the remaining conversions being attributed to a number of control factors. These could include natural, brand recognition, economic factors, total available market, competitor advertising/brand effects, etc. Without knowledge beforehand whether they correlate with your target, it can be difficult to decide whether or not to invest in data collection or purchase. Sometimes known factors may be unavailable due to timing granularity or proprietary access. From my point of view, this is the primary reason using a third party vendor to build an MMM is valuable, because their expertise and data collection infrastructure may overcome some of these limitations and augment your data with crucial factors.

Misinterpretation and Misspecification

MMMs have been around for a long time, and the amount of domain expertise developed has led to a few common solutions being preferred. The technical implementation of those has become almost trivial, and, in my opinion, the design of your model is entirely based around the set of assumptions you apply. This is advantageous if you fully understand the problem space, but without appropriate experience or data access, a lot of these assumptions will have to be made in an ad hoc manner or on gut feel.

Domain Assumptions

Before delving into the modeling process, you have to specify a number of domain specific assumptions including whether or not channel advertising effects are additive or multiplicative and what functional forms saturation curve and carryover lag effects should take. It would take extensive time and resources to actually test and validate all of these assumptions, and, unless you’re a large marketing firm that is providing MMM as a service, it’s unlikely you will ever be able to validate your assumption choices. In this case, you are accepting a risk of building a model that doesn’t correspond to reality and will always be wrong in an unmeasurable amount, but from what I’ve seen when reading guides and resources online, this risk isn’t highlighted nearly enough.

Forecasting Assumptions

Well-specified MMMs take into account non-media control variables to isolate incremental conversions from marketing spend from all other effects. In my experience, planning based on MMM results takes into account only the incremental marketing ROI measurements and fails to consider all of the control variables used as well. However, just as in life, no decision is still a decision, and not taking any changes into account is close to extrapolating the current situation forwards. I’m not necessarily saying that trying to predict future economic conditions given the volatility of Covid, inflation, recession (?), and interest rates is advisable or even possible, but focusing entirely on media effects is just another way to have silent, unmeasurable errors.

Understanding Time Period Application

Most MMMs until now have tended to use all historical time periods in the modeling process, and to update it, data since the last run is collected and appended to the full dataset. The outputs are analyzed and used to optimize marketing spend for an upcoming period, commonly a few months out. The fact that we are only learning a set of parameters for each channel over the whole time is highly concerning to me, given how we know that businesses shift their strategies, advertising effectiveness and scale (hopefully) over time. Yes, if the curve was static, we would be gaining more data points on curve and decreasing our uncertainty in our estimate, but assuming that all channels’ parameters have remained constant over months or years seems unreasonable to me.

There seem to be two main methods of combatting this issue. The first is recency weighting, which is a method of giving larger importance to later observations in inferring channel parameters, and while it is still able to leverage learnings from earlier time periods, it better represents recent trends to help align with planning. The other time varying coefficients which captures the parameter at different points in time, and seems to me to be powerful enough to become standard practice for MMM.

Conclusion (aka My Bad Opinions)

MMM still can enormously useful but requires thoughtful planning and review from domain experts and data scientists before digging into the actual modeling portion. If you can’t provide that level of support for data and assumptions validation, it may make more sense for you to purse a third-party marketing firm to leverage their skills instead.

The good news is that big tech seems to be realizing the adpocalypse is continuing and are beginning to invest in MMM tools. Here are some that I think are valuable:

Google’s Lightweight MMM
- This is a pretty feature complete python package for building Bayesian MMMs
- It has convenient features such as geo-level modeling, preprocessing and budget optimization
Facebook’s Robyn
- Very powerful and feature rich MMM building tool in R.
- In my opinion, this seems like the best option for developing your own MMM if you have resources to dive deep into it and aren’t full committed to only python
Uber’s Orbit
- This is a python library that implements the bayesian time-varying coefficients for time series forecasting
- If you want less functionality and are prioritizing time varying coefficients, this is the library for you
- Not a MMM-specific library so you are missing a lot of the preprocessing and diagnostic tools found in Robyn, but if you are willing to implement those yourself (they aren’t rocket science), then this is a great option

I mostly put this together to capture areas of discussion around MMM that I felt weren’t being addressed properly, and if you have any tips on addressing these problems, please feel free to share them here or elsewhere on the internet. I’m sure lots of folks are working on the same types of problems and increasing literature around the practical development of MMMs would help the industry a ton.

Spaghetti Code

2022-02-23T00:00:00+00:00

As I mentioned in an earlier post about tiling patterns, I really like truchet tilings. Each border of the tiles that traditionally have patterns that connect the midpoints of adjacent sides. I explored this concept a long time ago, but instead of using a traditional square grid, I used a hexagonal one instead.

View this post on Instagram

A post shared by Matthew Burke (@yot_club_)

Each hexagon has every side connected to another side, and by repeating these connections, you end up with some emergent structures of how the paths travel over time. I like calling them noodles, so that’s what I’ll do here from now on.

My original implementation was… a bit of disaster from the code perspective. I left it for a while, and when I went back to revisit it to color each of these noodles individually, I found myself confused by my own work and gave up. Later on, I still felt the urge to do improve upon my initial version and recreating it from the ground up seemed like the fastest option.

New Implementation

Paths Generation

I built upon my initial idea of defining a grid of flat-top hexagons, and added some definitions to keep things clear. Each side of the hexagon is assigned an index integer, as well as a list of integer pairs to define which sides of the hexagon are connected. Each hexagon could contain zero to three sets of connections, with their order denoting the order in which they would be drawn to the screen.

Once we generate connections for each of the hexagons, we iterate over all the hexagons and over all the connections for each hexagon. We start at the first connection, and if it has not been visited, recursively follow the path from one hexagon to the next, using the side indices we defined earlier to choose the next hexagon, until we reach an end, a visited path, or the starting point. As we traverse them, we mark each of the connections as visited, so that when we reach that connection later on, we will not pursue it. One note is that, since we don’t know whether or not we are starting at an endpoint, we traverse the path in two different directions, and just swap the connection orders for one of the directions before combining.

The result for following a single path in how it maps across hexagons can be shown below:

We do this for all hexagons and all connections, collecting them into path objects with their hexagon locations and x/y coordinates. Once this is finished, we have all our paths and just need to cap them with semicircles of the correct end colors. Here is a final render of all paths being drawn without caps, overlaid with the hexagon grid that underlies them all.

At this point, we have all we need and just need to assign the colors and work on generating interesting patterns.

Style Generation

While there were dozens of ideas I could have explored, I get overwhelmed with too many options so I limited myself to the following areas to delve deeper into and came up with the following final list of constrainted parameters that I chose to work within.

Colors
- Overall palette
  - What do we want to use for the background and noodle colors?
  - Even if a palette is good, does it have enough contrast to distinguish between foreground/background?
  - Does the palette have muddy colors if we have a gradient or do the in between colors look good as well?
- Noodle style
  - Color it directly or add an outline surrounding it
    - If we have an outline, do we keep a static one across all noodles or pick a different outline color for each?
  - Color the noodle with a single color or make a color changing graadient
    - If we have a gradient, do we rotate through the palette in order or assign sequential colors randomly?
Path shapes
- Connection patterns
  - Fully random for each hexagon
    - Do we keep three sets of connections for each one or do we have a lower amount?
  - Stripes along horizontal, vertical and diagonals
  - Chunks of MxN hexagon groups that are repeated along horizontal, vertical and diagonals with or without offsets
- Border
  - Do we feed connections back to the middle or just truncate?
- Curve polygon fidelity
  - Closer to fully round
  - Low poly

Ultimately, I think my favorites were gradient noodles without outlines that either had random connections or repeating chunk patterns.

View this post on Instagram

A post shared by Matthew Burke (@yot_club_)

Resources

Codenames Clue Generator using Semantic Similarity

2021-12-12T00:00:00+00:00

In this post, I’ll talk about how I built a clue generator for the game Codenames that provides a list of potential clues, numbers and associated target words, all with Tensorflow.

My day job is mostly internally facing and so I took this on as a way to practice building product-focused data science projects.

How Codenames Works

If you already know how the game works, feel free to skip or read again for a quick reminder.

Codenames is a card game with 2 teams. There are 25 cards laid out on the board, 9 belonging to one team, 8 belonging to another, 7 neutral and 1 double agent card.

Each time has a codemaster that can see which cards belong to which teams, and the remaining members of the teams are spies that only see a single word on each card.

The teams take turns having the codemaster provide a clue to their team made up of a single word and a number, with the clue relating to the number of cards on the board. The goal is to get the team to guess which words the clue is indicating, and they select cards to turn over.

If they select a card belonging to their team, they can continue guessing, but if they flip over a card that doesn’t, their turn is immediately ended and they could suffer the negative consequences of potentially flipping over the other team’s card, bringing them closer to their goal, or flipping over the double agent card and instantly losing the game.

Thus, the codemaster seeks to find clues that maximize the relationship to words on their team and minimize the relationship to words on the other team. Additionally, by finding clues with a larger number of cards it relates to, they can increase their chance of beating the other team by finishing first, but they risk having a lower relevance to each of the target cards and higher chance of accidentally missing a connection for opposing cards.

How I Built It

Project Goals

We represent a current board and team state with the following inputs:

Current team’s cards
Opposing team’s cards
Neutral cards
Double agent card

What we are looking for is a list of potential clues the codemaster could use with the following fields:

Clue word
Clue number
Target words the clue is intended to relate to
Quantitative measure of the quality of the clue

Quantifying Clue Quality

As with most data science problems, the hardest part if quantifying exactly what you are looking to maximize or predict. In this case, we have a vague notion of maximize and minimizing relevance of our clue word to words on the board. While there are many ways to do this, the way I chose to frame it for now is in terms of embeddings.

Word Embeddings

Word embeddings are a way to represent words quantitatively with a list of numbers, which we will refer to here as a vector. The main idea is that words with similar meanings will have similar number representations, and that related words will have a similar relationship. For example, woman -> man should have a similar relationship as queen -> king. Or Pooh -> Tigger should have a similar relationship as bear -> tiger (ok maybe this one’s a bit of a stretch, but you get the picture).

Rather than generating my own, I used a pre-trained model from Tensorflow, the Wiki-words-500 text embedding that already generated a mapping from words to their vector representations. I now have a function to translate any given english word into a vector of length 500.

Please see the end for discussions about future improvements related to choosing a embedding corpus.

Word Relevance - Cosine Similarity

Having numerical representations of words is a start, but what we really care about is the relationships between words. We need to compare the vectors to begin to use them.

When comparing vectors, you will often hear the language of distance and similarity, which are two sides of the same coin, meaning difference and closeness of two vectors, respectively. For certain types of distances, we may just subtract the value from one to switch between the two.

For this case, I chose to work with cosine similarity, although I may look into other options in the future. This gives us a single number ranging from -1 to 1, with -1 indicating two words’ being as dissimilar as possible and 1 being equivalent.

Clue Quality Metric

In order to summarize clue quality in a single number, we consider the benefits and penalties associated with the outcome of guessing a card on the table. Obviously, we want to incentivize choosing clues that are relevant to our team and decentivize other cards, with increasing penalties for the undesirable outcomes. Neutral ends our turn, the opposing team’s card ends our turn and advances them to the goal, and the double agent loses the game.

The way we summarize this is by multiplying the cosine similarity for each card on the table by a set of coefficients that represent these benefits/penalties. The process is as folows:

Extract word bank embeddings and cache since they will be reused for all games
Get current game word embeddings
Calculate cosine similarity between all game words and all word bank words
Multiply similarity scores by appropriate card type coefficients
Sum up all final scores for each word bank word to get clue quality metric

This can all be accomplished very quickly with Tensorflow using their pre-trained embeddings and a series of matrix multiplications.

Clue Size

We do have an additional constraint to limit the number of words that the clue relates to, which changes how we think about the quality metric. The overall structure remains the same, but we need some way to determine which of our team’s cards to include in the clue.

The way I implemented it was to set a similarity threshold and only keep clues that have a similarity value equal to or greater than the threshold. This is the most straightforward way, and it ensures a global level of relevance. Of course, this introduces another parameter to tweak that we don’t have an exact way to measure the effectiveness of, and we do run the risk of excluding relevant clues that fall right below the cutoff. However, as problems go, having your team select another one of their cards is a decent one to have, although it may cause confusion later down the line.

The above process for calculating the quality metric remains the same as above, but, first we go through and remove all cards below the similarity threshold, and then calculate the contribution of the remaining ones towards our metric.

Setting Coefficients

It’s clear that we want a positive coefficient for our cards and monotonically decreasing negative coefficients for opposing, neutral and double agent cards respectively, but it’s not obvious exactly what they should be for several reasons:

All of the coefficients are relative to one another so there isn’t a single global optimum
We are codifying the codemaster’s risk preferences to a single set of numbers
- Some people may have a higher risk tolerance for clues similar to the double agent card, or they may never want to even have a small chance of guessing it
The number of cards in each category changes over the course of the game
- We may need to scale the contributions of remaining team/opposing cards. If both teams are guessing accurately, there will be few cards belonging to them and a higher concentration of neutral cards.
- This may dilute the quality metric by having it be mostly composed of negative scores. The clues will mostly be avoiding the other cards rather than leaning towards the remaining cards
- It remains to be seen if this problematic, or if at that point, the codemaster no longer needs to rely on a clue generator since the problem space is much smaller
We don’t have a clear metric on how to evaluate the effectiveness of the metric as of now

Solution Validation / Testing Plans

Number 4 above is the elephant in the room: How do we know our solution is effective?

The ideal method would be to test a bunch of games with randomly assigned teams, and provide the test teams with access to the clue recommendations. Our expectation is that the win rates would be equal between groups, and any significant difference would be driven by access to the tool. I would rather test giving tool access, but not mandating usage, because that’s a more realistic scenario in practice than forcing them to use the top recommendations every time. At this point, I don’t think we would consistently beat player intuition, so it’s not a valid comparison. However, the time required to get volunteers and acquire data seems impractical, so are there any other ways we can perform testing?

Backtesting

If you run a codenames online site with textual clue inputs, you could backtest and see how many times the clues recommended by users would have been recommended by the tool.There are multiple metrics used in recommender systems you could use to evaluate performance including NDCG or an adapted version of Mean Average Precision.

Regardless of what method you use, there are several problems:

Sometimes people give bad clues. I’ve done it, others do it. It’s fairly common. How will this affect our scores?
- We could potentially do some censoring to only include clues where the codemaster’s team guessed all of the associated words correctly if we had access to it. We could determine whether or not they guessed the correct amount of clues, but as far as I’ve seen, online sites don’t seem to have tagging for relevant words to clues. At the very least, it would be a more fair comparison, even if there’s still a known source of error.
The recommender word bank may include many words not in the common vernacular that are still relevant. Should they be penalized just because they’re niche?
We don’t have any proper nouns in our word bank. These can be very effective: think Potter for ceramic and magic as an example.

Mechanical Turk

A common way to generate datasets for bespoke targets is through Amazon Mechanical Turk , where you can get people to complete arbitrary tasks online for money. This often is used in ML to generate labels for unsupervised data such as images or natural language.

In this case, proper evaluation takes a fair amount of background understanding of the game just to be able to make evaluations, and for accurate evaluations, experience actually playing. Given the cost of getting random people to take time to learn a new game, confirm that their understanding is accurate, and then to actually play test games would be exorbitant, we need to modify our method into easier to consume subtasks that are proxies for clue quality.

I propose that we could potentially focus on getting people to evaluate clue similarity or dissimilarity to a set of words. This could be done either as choosing the most/least relevant clue to a set of words from a list of potential clues, or providing a clue and bank of words, and having them choose the most/least relevant words to the clue. This removes the need to evaluate multiple objectives simultaneously, and increases the amount of data we could collect per dollar. Evaluation would be between existing versions of the clue generator, or between existing game samplesa dn the clue generator.

Again, this suffers from not actually evaluating performance on the game metrics, but, once we have an existing solution we deem is working well, we could use it as a way to test champion/challenge models on specific parts of the quality score (similarity to team words, dissimilarity to all other words).

Future Work

If not obvious by now, there are a lot of potential areas for improvement that I would like to pursue given time, but here are some of the main ones:

Graph-Based Similarity

The current approach suffers from words with multiple meanings, the curse of dimensionality, a lack of concrete, objective measurements of similarity, and proper nouns in the word bank.

Switching to a knowledge graph, or even web-search PageRank like approach would help shore up the above problems and maybe be used in tandem with semantic similarity recommendations if not replacing it entirely.

Word Embeddings

Additional research into more appropraite pre-trained word embeddings
Generate our own embeddings by training an NLP model on a corpus we designed for this

Quality Metric

Add a relative score component for clue selection
- Using an elbow method similar to identifying the appropriate number of clusters?
Scaling based on number of cards still available to deal with clue dilution of team’s cards compared to other cards

Resources

Codenames Online Game
Word Embeddings:
- Machine Learning Mastery: What Are Word Embeddings
- Tensorflow has a guide to working with embeddings in a neural network for those who work in ML/NLP.
Cosine Similarity

Basic Geometric Tiling

2019-06-18T00:00:00+00:00

Geometric Tiling

I went on vacation to Italy recently, and while I was there, I fell in love with the mosaic tilings in the Cathedral of Santa Maria del Fiore and Baptistery of St. John in Florence. In general, I’m a huge fan of geometric design, but the designs reallly caught my eye, and I did my best to recreate some of them in processing with some nonstandard color palettes:

If these piqued your interest, I’d recommend checking out more at my generative art site, or much better, go visit Florence yourself and get inspired!

Of course, once I returned home and was talking about how beautiful the tiling was, I was informed about Islamic geometric patterns, which blew Italy out of the water in terms of complexity and creativity. I definitely will be reviewing my future travel plans in light of this discovery, and in the meantime, I hopefully can learn more about their theory and history to get a better appreciation of them.

Organic Tiling: Truchet Patterns

While geometric patterns are always awesome, I had recently run into this article talking about truchet patterns, and wanted to try something a little more rounded and actually generative to see if it had a more “organic” feel about it.

The idea behind them, is that they are square tiles with round internal paths/connections that can be connected to any other tile pattern. It didn’t take long to create each one of them, but after just generating a random tileset, the results are rather unsatisfying:

This is pretty much inline with what I have been viewing and reading from well-known generative artists, and so I took a stab at creating a little more structure into the process by nesting squares of patterns within each other and was quite pleased:

I took it a step further, and while still limiting the available tiles and placing them diagonally, I allowed the rotation vary. These might be some of my favorite results in that they’re not so random as to be without structure, but it seems more natural:

There’s definitely more work I could do with utilizing the smaller subtiling and larger amount of tiles, but I’ll put that ahead for future work. If you’re interested in seeing more of these patterns, you can do so here and create as many as you want!

I hope to do some more work on hexagonal tiling with connections based on node-based growth algorithms, which I think have a lot of potential for walking the line between structure and chaos.

Resources:

Multi-Sscale Truchet Patterns - Christopher Carlson

Multi-Armed Bandits Exploration

2019-06-18T00:00:00+00:00

Multi-Armed Bandit Overview

A multi-armed what?? If you don’t know what the multi-armed bandit problem is, then you may be confused. I’m assuming that you have some background on this for the rest of the post, but if you don’t, here’s a quick rundown:

Pretend you are a someone who’s looking to go gambling, and an old style slot machine (aka bandit, don’t worry about why) you can choose from that has multiple arms. Your goal is (obviously) to make the most amount of money from putting coins into it and pulling the arms. However, given that you can only try one arm at a time, how do you find the arms(s) that give you the most bang for your buck without wasting time on arms that just eat your money?

That’s essentially what the multi-armed bandit problem is. How do we maximize rewards by exploring new arms we don’t know much about (have only played zero or a few times), while still exploiting (or taking advantage of) the arms we already know give us good rewards?

Alright, now that we’ve covered that, we can jump into some code and ways I explored common algorithms used to maximize profits in this scenario.

Bandit Definitions

But first, let’s look again at how the bandits themselves are defined. I played around with two types:

Bernoulli Bandit: each arm in the bandit has a set probability each time it’s pulled of returning a reward of 1 or 0
Gaussian Bandit: each arm in the bandit has a mean and standard deviation that define a gaussian distribution. When pulled, it samples from that distribution to return a reward.

Here’s a quick visualization of means and a standard deviation away from those means to get an idea of the potential overlap in rewards you may get from the gaussian bandits. The x axis is the arm number, and the y axis is the reward distribution.

Clearly arms 3-4 are the highest ones, but their rewards still overlap greatly with 2, and it would be tricky to tell which one is best, given the amount of noise when sampling.

Execution

The methods to choose arms in a programmatic away could be called methods or algorithms or whatever, but since I’ve been exploring reinforcement learning recently, I’m going to call them agents.

At each timestep a few things happen:

The agent evaluates its current stored information and chooses an arm to interact with
The agent pulls the chosen arm and receives a reward in return
The agent makes updates to its stored information based on the reward

The parts where the different methods differ is mainly in step 1, where they use different methods to choose the arm. Step 3 supports step 1 by updating the stored information, and is similar across most agents with some minor differences.

Evaluation Procedure

In the following section, I compare agents with different parameters to each other by running an agent against a bandit for a pre-defined number of timesteps repeatedly. By doing this multiple times and tracking the rewards at each timestep, we can get a sense of what average performance we can expect from the agent at each timestep.

Naturally, we should see lower average rewards earlier on since we are still exploring and are uncertain of which arms provide the best value, but what we hope to see is a gradual increase in rewards until we identify the optimal arm, at which point the rewards should flatten out to the average of the optimal arm’s reward.

The two plots I include each with the comparisons track both of the metrics over time:

Average reward at each timestep
Percent of times the agent chose the optimal arm at that timestep

As you will see, the former can be a rather noisy chart (especially with gaussian reward functions), but the latter results in a smoother chart.

Agents

Epsilon Greedy

The epsilon greedy agent is an agent is defined by two parameters: epsilon and epsilon decay.

Every timestep, in order to select the arm to choose, the agent generates a random number between 0 and 1. If the value is below epsilon, then the agent selects a random arm. Otherwise, it chooses the arm with the highest average reward (breaking ties randomly), thus exploiting what it knows.

A higher epsilon results in more exploration (random arm selections), and a lower epsilon results in more exploitation.

Because we may not want to keep the same epsilon over the life of our problem, we introduce the epsilon decay parameter, which decreases the value of epsilon after each timestep. This naturally lends itself towards a high explore approach at the beginning when we are unsure of the arm rewards, and a high exploit approach later on once we have more information.

In theory, this seems like a good idea, but in practice (with noisy rewards), decaying epsilon seems to have slightly lower performance. However, I did not implement a minimum epsilon, which could help by preventing a fully-exploit scenario.

Below is a comparison of some different parameters of epsilon greedy agents:

Here is a comparison of the best decay rate I found (ratio of 0.9999 per timestep) with different starting epsilon values.

UCB

The upper confidence bound (UCB) agent tracks the average reward for each arm similar to epsilon greedy, but rather than encoding its exploration as a binary random chance, it attempts to measure uncertainty in terms of how long it has been since a arm has been chosen.

Each timestep, the agent chooses the arm with the highest average reward plus “uncerainty”, and the uncertainty for each arm not chosen increases a little bit.

Earlier on, every timestep where a arm is not chosen increases uncertainty by a significant amount. As the system time grows, the uncertainty contributed by each timestep decreases since we should have more accurate estimates of rewards as time progresses.

An important note is that this uncertainty is not what we normally think of in statistics and is not related to the variance of the reward estimates.

The influence of the uncertainty factor is determined by a parameter C. Below is a comparison of some runs with different values of C:

One of the main purposes of this repo was to help visualize the UCB agent, in terms of how it balances the average rewards received so far and the uncertainty of unused arms.

Below is as gif of a UCB agent in action. Each frame in the gif is a step where the agent chose an action, received a reward, and updated its estimates/uncertaintities for each arm.

The blue parts of each bar are the average rewards for that arm, and the orange parts are the uncertainty. You should be able to see the blue parts jump around as the highest total blue + orange arm is pulled, while the non-pulled arms’ orange parts should steadily increase until they become the highest bars.

At first, the values will most likely jump around more as the variance of the reward estimates is large, but as it progresses, it should settle into selecting a few arms repeatedly until there is one main winner.

Gradient Method

The prior two algorithms choose arms based on the average score values, selecting the highest performing one (with some initial exploration). Gradient-based algorithms instead relies on relative preferences for each arm that do not necessarily correspond to actual rewards values. At each timestep, the rewards for an arm are observed, and then an incremental update to the existing preference score is made based on new score and a parameter alpha. This is similar to stochastic gradient ascent, and a larger alpha will result in a larger step size.

The details for updating the preference values \(H_{t}(a)\) for selection probabilities \(\pi_{t}(a)\) selected action \(A_{t}\), rewards \(R_{t}\), and average reward \(\overline{R_{t}}\) are as follows:

\(H_{t+1}(A_{t}) = H_{t}(A_{t}) + \alpha (R_{t} - \overline{R_{t}})(1 - \pi_{t}(A_{t}))\) for action \(A_{t}\) and

\(H_{t+1}(a) = H_{t}(a) - \alpha (R_{t} - \overline{R_{t}})\pi_{t}(a)\) for other actions \(a \neq A_{t}\)

When choosing an arm, the agent passes these arm preferences through the softmax distribution to assign weights to all arms that add up to one. These weights are the probabilities that each arm is chosen. After each step, the average rewards are updated, then the weights for sampling are recalculated.

In case you aren’t familiar, the softmax distribution is as follows: \(P\{A_{t} = a\} = \frac{e^{H_{t}(a)}}{\sum_{b=1}^k e^{H_{t}(b)}}\)

One thing to note is that the average rewards at the start before any weights are input affects the results. Starting all arms out with a value greater than zero will still have an effect of an equal chance for all arms to be selected, but will encourage more exploration in the short term before potentially lowering poorly performing probabilities of being selected almost to zero.

Interactive Notebook

I created a github repo with all of the code used to generate these plots, with a notebook ready to to re-run them and change any parameters so you can get an intuition about how some of these common agent algorithms work.

I’d highly recommend playing around with different numbers of arms, bernoulli rewards, and various levels of noise in the gaussian rewards by increasing and decreasing the standard deviation compared to the means.

Introduction to Idyll

2018-12-04T00:00:00+00:00

Over Thanksgiving, some friends of mine set out to find the best pumpkin pie recipe and in the process, baked 5 different pies for comparison. After enjoying and ranking them, they decided to open the survey population to let others determine what the truly best pie was with a blind taste test. Being a data nerd himself, my friend tracked all of these responses and passed htem onto me so that I could take a stab at visualizing them with a new data interactive visualization framework I had recently discovered.

Idyll

Idyll is, according to their website, “a toolkit for creating data-driven stories and explorable explanations” that makes it simple and quick to create interactive visualizations, and in my opinion, it’s the easiest tool out there to get involved with the communication medium of “scrollytelling”. The base .idyll file that renders into the final webpage is based on Markdown, but it has a few features that make it an extremely effective tool to prototype quickly but still support more advanced work.

React Integration

One of the most powerful aspects is that it is integrated with React so enable the easy inclusion of pre-made components. It natively has support for a set of simple graphs generated from csv or json files. I wasn’t able to generate what I wanted with these, so I went ahead and added vega-lite through npm and within a few minutes had a new chart from my existing data source.

Additionally, it’s fairly straightforward to take existing d3 visualizations, make a few minor modifications, wrap them in a React component and them embed onto your page. In my test, I included a parallel coordinates chart taken directly an Observable notebook, changed a few lines of CSS and had a working chart much faster than I expected.

Property Management

The other fantastic feature of Idyll is the ability to create and manage variables with properties you can both acacess in your different components as wlel as recalculate in real time based on user input.

For example, I can have a variable that can be modified from a variety of pre-made source including a button, slider, text, scroll trigger, etc that can in turn update any visualizations on the page with the new properties. I don’t have to write any additional event listeners and can reuse these properties wherever I want to on the page. I didn’t leverage a ton of these features other than reusing some of my data files and visualization configuration parameters such as width/height, but the possibilities really are endless.

Getting Started with Idyll

If you would also like to get started with Idyll for your own projects, you can take a look at the full post I created with Idyll and the underlying code to see how it was generated, and then head on over to Idyll’s Example Gallery page to see amazing work on how far you can take this framework.

What’s the Catch?

Tools always have tradeoffs and Idyll embraces a markdown-like language which allows quick development. For more advanced visualizations and custom triggers, it may be worth choosing a more flexible for time-consuming framework to get the exact effects you want.

Additionally, the work only supports single-post rendering as of now, and the user has to create their own process for hosting multiple posts on a single website/platform. There are a few options out there trying to deal with this, but according to this github issue, it looks like they’re beginning development to support this.

Probability Calibration

2018-11-26T00:00:00+00:00

Predictions As Confidence

As you may already know, classification problems in machine learning commonly (though not always) use algorithms that output a predicted probability value that can be used to gauge confidence in how sure your model is that the input belongs to one particular class.

Setting Probability Thresholds

In introductory ML courses, a default value of 0.50 is usually used as the prediction cutoff for making the decision to consider a binary classification output as either positive or negative class, but in industry, selecting the right cutoff threshold is critical to making good business decisions.

If the cost associated with false negatives is large, it may be optimal to use a lower probability decision threshold to capture more positive users at the expense of including more false positives, and vice versa, and the data scientist will work with the business units to balance this tradeoff in order to minimize cost or maximize the benefit. Ultimately, this causes the predictions to act more as a ranking system for applications that require binary classifications as the final output than actually leveraging the values themselves.

Interpretation Problems

Model As Ranking

This model-as-ranking system works fine in many situations, but what happens when your predicted probability does not actually represent the probability and the business unit consuming your predictions assumes that they are? An example of this might be the likelihood of conversion for a given user, which is then multiplied by potential LTV to prioritize leads for a sales organization based on expected ROI. If a model tends to over/underestimate probabilities at the lower/upper ends of the predicted probability spectrum respctively (as random forest models have been known to do), you can end up spending effort on individuals who are less worth the team’s time, wasting resources and potentially losing revenue.

Scikit-learn has a great overview on some common algorithms that result in biased predicted probabilities. I’ve taken the liberty of displaying the chart from that overview here. Visit this link to get the full code used to generate the plot or just look at the documentation for the sklearn.calibration.calibration_curve function

Parallel Model Consumption

Additionally, models can be used in conjunction with one another to provide targets in context. Going back to our expected LTV example, a business may have separate conversion likelihood models for different segments of their customer population, with every user being assigned a conversion probability from a single model. If not all models produce well-calibrated predicted probabilities, one could end up dominating the others while still having good metrics when considered individually.

AUROC Can Be Misleading

One common performance metric that is used to measure the effectiveness of the model across the range of predicted probabilities is the area under the receiving operating characteristic (ROC) curve. In case you aren’t familiar with the ROC curve, it is a plot of the model’s true positive rate vs the false positive rate as the probability is varied from 0 to 1, and as such, it is considered more of a robust metric than accuracy alone in cases where classes are imbalanced or the cost of true/false positives are unknown as of yet.

While it is a good metric, it is not sensitive to the absolute value of the predicted probabilities, only the performance at every probability point. If all of the predicted probabilities are multiplied by a constant, the value of the AUROC does not change, which may mislead the modeler into believing their probabilities are good to use, while in fact, they are consistently over/underestimating the results.

For example, the three predicted probability density distributions below are just scaled versions of the output from the same model. Their distributions are obviously very different from one another, but because they are scaled by a constant, they all have an equivalent AUROC score.

Validation with Additional Scoring Methods

As with most modeling, it’s impossible to represent overall performance with a single number, and if you have concerns about validating probability calibration, it seems wise to include additional scores alongside AUROC that are more representative of actual differences in calibration such as log loss or the brier score.

Log loss is a common loss function, but brier score was new to me, and according to wikipedia “can be thought of as… a measure of the ‘calibration’ of a set of probabilistic predictions”. It essentially is the average squared difference between the probability that was forecast and the actual outcome of the event. This makes its interpretation analogous to the RMSE for regression problems, and does take into account the scale of the predictions.

Sampling Bias

Many problems have imbalanced datasets in terms of the target variable with a significant portion of the records belonging to one class. Various techniques have been developed to counteract these problem, including oversampling the minority class, downsampling the majority class and generating synthetic samples from the minority class to closer achieve class number parity. However, these techniques can result in increased AUROC scores while biasing the predicted probabilities to be less calibrated to actual.

Here is an example of how a generally well calibrated classifier (Logistic Regression) can be biased depending upon the ratio of the positive to negative class in the training dataset:

Further Research

Scikit-learn has implemented the CalibratedClassifierCV class to adjust your classifiers to be more calibrated either during training, or to adjust the predictions by calibrating the classifier post-training.

It has two options for doing so:

My Intro to Generative Art

2018-07-09T00:00:00+00:00

What is generative art?

Generative art is procedurally generated art for those of us who are less traditionally artistically inclined. More specifically, those who have no skill but still have enough appreciation for art and mathematical principles to automate the creation of things that look snice.

Javascript Libraries

The go-to library for web-based mathematical visualization is d3, and many visualization libraries are based upon it. However, recently I stumbled across processing, and it’s javascript equivalent p5js, which are amazing for creating procedurally generated visualizations. It’s inherently build to support an initialization process with the setup function and a function to update the visualization frame-by-frame with the draw function.

It’s easy to pick up and get going and create something really quickly, and it’s only been a few days since I’ve heard about it, and I’ve already had tons of fun learning the API and using the basics to create some “art” I’m happy with. I highly encourage you to check it out and maybe some of the stuff I’ve made recently at my interactive generative art website.

If you are too lazy to click on the link, here’s a few examples of the static visualizations as well as the one in the post header.

Matthew Burke’s Blog

Distribution Drift Tolerance

Does PSI Have Reliable Thresholds?

Data Generation

Evaluation Loop

Preference Estimation

Conclusions

Resources

PyMC Wrapper

Overview

Assumptions

Configuration Driven

Example Configuration

Model Usage

Creation and Training

Prediction

Saving/Loading

Creating Config with Posteriors as Priors

Periodic Retraining

Feedback

MMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?

MMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?

TL;DR

What is an MMM?

Privacy Proof?

Evergreen Data

The Pitfalls

Lack of Quality Data Input

Not Enough Data

Not Enough Data Variance

Missing Covariates

Misinterpretation and Misspecification

Domain Assumptions

Forecasting Assumptions

Understanding Time Period Application

Conclusion (aka My Bad Opinions)

Spaghetti Code

New Implementation

Paths Generation

Style Generation

Resources

Codenames Clue Generator using Semantic Similarity

How Codenames Works

How I Built It

Project Goals

Quantifying Clue Quality

Word Embeddings

Word Relevance - Cosine Similarity

Clue Quality Metric

Clue Size

Setting Coefficients

Solution Validation / Testing Plans

Backtesting

Mechanical Turk

Future Work

Graph-Based Similarity

Word Embeddings

Quality Metric

Resources

Basic Geometric Tiling

Geometric Tiling

Organic Tiling: Truchet Patterns

Resources:

Multi-Armed Bandits Exploration

Multi-Armed Bandit Overview

Bandit Definitions

Execution

Evaluation Procedure

Agents

Epsilon Greedy

UCB

Gradient Method

Interactive Notebook

Introduction to Idyll

Idyll

React Integration

Property Management

Getting Started with Idyll

What’s the Catch?

Further Reading