Jekyll2023-09-29T23:04:04+00:00https://mwburke.github.io/atom.xmlMatthew Burke’s BlogData science, generative art and other stuffMatthew Burkematthew.wesley.burke@gmail.comDistribution Drift Tolerance2023-09-28T00:00:00+00:002023-09-28T00:00:00+00:00https://mwburke.github.io/data%20science/2023/09/28/distribution-drift-tolerance<h2 id="does-psi-have-reliable-thresholds">Does PSI Have Reliable Thresholds?</h2>
<p>I’ve written before about the use of <a href="/data%20science/2018/04/29/population-stability-index.html">PSI (population stability index)</a> to measure population drift and provided some guidelines on interpretation of scores taken from <a href="https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf">this resource</a>. I recently began a project involving model monitoring and began to question how the authors of that paper derived their thresholds to define slight, minor and significant changes, and have so far received no answers. This seems like a huge gap for an entire industry that has science in the name to be following a standard that has no formal basis.</p>
<p>To remedy this, I have created a <a href="https://github.com/mwburke/drift-tolerance/blob/main/drift_tolerance.ipynb">jupyter notebook</a> that allows the user to run through concrete examples of comparing distributions, evaluating them for drift, and creating a decision boundary representing the summary of their preferences.</p>
<h2 id="data-generation">Data Generation</h2>
<p>In order to provide distributions to compare, I chose to use a skewed normal as a familiar but not quite standard distribution DS/ML practitioners may encounter in their jobs. I first generate random parameters for skew, center and scale of the distribution for the base distribution, and then modify each of those parameters by random value ±25% from the original. This gives a base and alternate distribution that we are treating as the original population, and one later in time.</p>
<p>We then sample each distribution a number of times to create our final histograms that will be reviewed by the user. For each set of samples, we can calculate the true PSI value and store it until the user has time to evaluate.</p>
<h2 id="evaluation-loop">Evaluation Loop</h2>
<p>When using the notebook, a single cell will show the plotted distributions against one another, but the true PSI value is not shown.</p>
<p><img src="/images/drift_distributions.PNG" alt="" /></p>
<p>The user must then click on one of two buttons to label the two populations as having acceptable or unacceptable drift. After clicking, their evaluation and corresponding PSI value are logged to a list, and the process repeats by replacing the plot with two new distributions and their hidden PSI value.</p>
<p>The user can continue this process as many times as they like until they feel satisfied that they have provided enough sample data to work with.</p>
<h2 id="preference-estimation">Preference Estimation</h2>
<p>What we have done here is created a labeled dataset that encodes the user’s intuition on how much difference in two distributions constitutes an unacceptable risk in terms of population drift. We then can learn a logistic regression model to quantitatively measure that relationship. I chose to use PyMC to build the model because of the small data sizes and noisiness in measurements (more below), but you could use anything. Once this model has been learned, we can plot our observations alongside the curve from our learned beta parameter to compare our expectations of how we feel about risk to the measured reality. You can take a look at my results in the following chart:</p>
<p><img src="/images/drift_logistic.PNG" alt="" /></p>
<p>As we can see, this doesn’t look like your usual S curve. If we zoomed out significantly, we would see the shape, but as is, due to overlap between our labels, our model doesn’t have the nice, clear-cut decision boundary we were hoping for.</p>
<h2 id="conclusions">Conclusions</h2>
<ol>
<li>Drift is subjective from person to person, and maybe even more so within an individual. Humans are bad at probability and most visual mathematical intuition that exceeds a few data points.</li>
<li>Reliance on industry benchmarks without appropriate supporting data can leave you and your organization vulnerable to hidden mistakes and overconfidence.</li>
<li>Folks working in the data industry need to challenge assumptions and if datasets do not exist, seek them out or curate them yourself.</li>
</ol>
<p>I can imagine that going through this exercise with key members of your model validation/risk management leadership may prove insightful, and if you want to get even fancier, it might be fun to build a hierarchical logistic model to capture the risk preference of the organization overall as well as being able to compare individuals’ preferences.</p>
<h2 id="resources">Resources</h2>
<ul>
<li><a href="https://github.com/mwburke/drift-tolerance">Drift Tolerance Repo</a></li>
</ul>MatthewIdentifying your personal preference for measuring population driftPyMC Wrapper2023-02-23T00:00:00+00:002023-02-23T00:00:00+00:00https://mwburke.github.io/data%20science/2023/02/23/pymc-wrapper<h2 id="overview">Overview</h2>
<p>Bayesian modeling can be super valuable for capturing uncertainty and leveraging the use of prior distributions for new products/geos/etc. I’m relatively new to the space, but in my role of machine learning engineer, I found the tools to be very focused on the science and less so on deployment. While the science part is absolutely critical and must be handled thoughtfully and methodologically, there are different sets of concerns when managing a suite of models in production.</p>
<p>I build a quick POC python library called <a href="https://github.com/mwburke/pymc-wrapper">pymc-wrapper</a> to align PyMC’s amazing Bayesian modeling capabilities with the ease of use of scikit-learn’s fit/predict paradigm. There certainly is more work to do to build out a robust system, but I think for simpler modeling projects that require loading, saving and prediction across a number of models, this package could simplify workflows.</p>
<h3 id="assumptions">Assumptions</h3>
<p>Because the goal wasn’t to capture all cases and mostly to prove out the idea, I incorporated several assumptions:</p>
<ol>
<li>There is a single function that relates all independent variables to our dependent variable</li>
<li>The errors are normally distributed</li>
<li>The model is non-hierarchical</li>
<li>All data preprocessing is handled beforehand</li>
</ol>
<h2 id="configuration-driven">Configuration Driven</h2>
<p>The main idea behind the package is to define your independent variables and PyMC distributions beforehand in a config file and leave the model generation and sampling hidden within the other functions. The configuration would be defined as follows:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">independent_vars</code>: a list of the names of each independent variable</li>
<li><code class="language-plaintext highlighter-rouge">sample_params</code>: a dictionary of parameters to be passed into the pm.sample function</li>
<li><code class="language-plaintext highlighter-rouge">variable_params</code>: a dictionary of PyMC variables to be used in the model
<ul>
<li><code class="language-plaintext highlighter-rouge">variable_dict</code>: a dictionary defining a specific variable definition with
its name as the variable_params key
<ul>
<li><code class="language-plaintext highlighter-rouge">dist</code>: str of the name of the PyMC distribution (case sensitive) to be used for the variable</li>
<li><code class="language-plaintext highlighter-rouge">params</code>: a dictionary of the parameters for the variable and the values to be used as
the model’s priors</li>
</ul>
</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">function_params</code>: a dictionary of function parameters
<ul>
<li><code class="language-plaintext highlighter-rouge">function</code>: a python function defining the relationship of independent variables
to the dependent variable</li>
</ul>
</li>
</ul>
<h3 id="example-configuration">Example Configuration</h3>
<p>Here is an example config file stored in YAML that could be used to generate a model that attempts to learn product adoption rates according to the negative exponential function.</p>
<p>The independent variable would be the <code class="language-plaintext highlighter-rouge">week</code>, and we take both an <code class="language-plaintext highlighter-rouge">intercept</code> and <code class="language-plaintext highlighter-rouge">lambda_val</code> (chosen to not clash with the reserved <code class="language-plaintext highlighter-rouge">lambda</code> keyword in python) parameters to generate our final outcome.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">negative_exponential</span><span class="p">(</span><span class="n">week</span><span class="p">,</span> <span class="n">lambda_val</span><span class="p">,</span> <span class="n">intercept</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">intercept</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">intercept</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">lambda_val</span> <span class="o">*</span> <span class="n">week</span><span class="p">)</span>
</code></pre></div></div>
<p>The configuration reflects this by naming our independent variable and defining our parameters with corresponding <code class="language-plaintext highlighter-rouge">pymc.distribution</code> function names and prior values. We can also set our sampling parameters as well.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">independent_vars</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">week</span>
<span class="na">variable_params</span><span class="pi">:</span>
<span class="na">lambda_val</span><span class="pi">:</span>
<span class="na">params</span><span class="pi">:</span>
<span class="na">sigma</span><span class="pi">:</span> <span class="m">1</span>
<span class="na">dist</span><span class="pi">:</span> <span class="s">HalfNormal</span>
<span class="na">intercept</span><span class="pi">:</span>
<span class="na">params</span><span class="pi">:</span>
<span class="na">mu</span><span class="pi">:</span> <span class="m">0</span>
<span class="na">sigma</span><span class="pi">:</span> <span class="m">1</span>
<span class="na">dist</span><span class="pi">:</span> <span class="s">Normal</span>
<span class="na">sample_params</span><span class="pi">:</span>
<span class="na">draws</span><span class="pi">:</span> <span class="m">1000</span>
</code></pre></div></div>
<p>To simplify things, I kept the function definition out of the config and manually set it afterwards with</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">config</span><span class="p">[</span><span class="s">'function_params'</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'function'</span><span class="p">:</span> <span class="n">negative_exponential</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="model-usage">Model Usage</h2>
<h3 id="creation-and-training">Creation and Training</h3>
<p>Once we define and load our configuration from a file into a dictionary, creating a model wrapper object and training is as easy as the following:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">PymcModel</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="n">independent_vars</span><span class="p">]]</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">dependent_var</span><span class="p">]</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="prediction">Prediction</h3>
<p>Prediction is easy as well, and similar to above, all we have to do is prepare our test data and pass it to the <code class="language-plaintext highlighter-rouge">predict</code> function as we would for many other model libraries. Below is an example of training on a set of curves and predicting on the same timeline to ensure that we have learned the relationship correctly:</p>
<p><img src="/images/pymc_wrapper_learned_comparison.png" alt="" /></p>
<p>Because (IMO) the point of using Bayesian models is to take advantage of their ability to capture uncertainty, I augmented the <code class="language-plaintext highlighter-rouge">predict</code> function to take in an <code class="language-plaintext highlighter-rouge">alpha</code> parameter, which is used to generate credible intervals for predictions as well, and they are output along the median predictions in the output. Below is an example using an <code class="language-plaintext highlighter-rouge">alpha</code> of 0.9 with the same dataset as before:</p>
<p><img src="/images/pymc_wrapper_credible_interval.png" alt="" /></p>
<p>It’s as easy as that!</p>
<h3 id="savingloading">Saving/Loading</h3>
<p>Like other ML libraries, we often want to save our model for prediction at another time, and I have implemented <code class="language-plaintext highlighter-rouge">save_trace</code> and <code class="language-plaintext highlighter-rouge">load_trace</code> functions accordingly to facilitate this. As long as we reference the same config file for model creation, we can load a saved trace and get straight to prediction.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">trace_file_path</span> <span class="o">=</span> <span class="s">'trace.pkl'</span>
<span class="n">model</span><span class="p">.</span><span class="n">save_trace</span><span class="p">(</span><span class="n">trace_file_path</span><span class="p">)</span>
<span class="n">new_model</span> <span class="o">=</span> <span class="n">PymcModel</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
<span class="n">new_model</span><span class="p">.</span><span class="n">load_trace</span><span class="p">(</span><span class="n">trace_file_path</span><span class="p">)</span>
<span class="n">new_model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="creating-config-with-posteriors-as-priors">Creating Config with Posteriors as Priors</h3>
<p>One potential use case I’ve found interesting is to learn from a established product/geo and apply the posteriors as priors to a new area. Rather than manually updating configs, I added an <code class="language-plaintext highlighter-rouge">export_trained_config</code> function that updates the priors in the original config with posterior means and saves it to an output file. Similar to above, we could consume this updated config file to create a new model object.</p>
<h3 id="periodic-retraining">Periodic Retraining</h3>
<p>One last use case I thought might be interesting to showcase is the ability to call the <code class="language-plaintext highlighter-rouge">fit</code> function on multiple sets of data. In the following example, I fitted a model to the adoption rate curves we saw above, exported the trained config file, created a new model with it as priors, and finally did a series of repeated fitting on data in chunks of 4 weeks to see how the predicted curve developed over time. The performance at the beginning isn’t great, as it overpredicts the curve steepness at first, but with the noise early on, it’s not totally unreasonable. Either way, it definitely gets more confident over time as it gains more observed data and stabilizes after a few months of data.</p>
<p><img src="/images/pymc_wrapper_monthly_update.gif" alt="" /></p>
<h2 id="feedback">Feedback</h2>
<p>While this is definitely a work in progress, I would love for you to check out <a href="https://github.com/mwburke/pymc-wrapper/tree/main">the repo</a> and the <a href="https://github.com/mwburke/pymc-wrapper/blob/main/example/example_walkthrough.ipynb">example walkthrough notebook</a> that I used to generate these plots and examples.</p>
<p>If you have any feedback of things you would change, would add or even any thoughts of it something like this is needed at all, I would love to hear from you at <a href="mailto:matthew.wesley.burke@gmail.com">my email</a> or github in issues/PRs!</p>Matthew Burkematthew.wesley.burke@gmail.comMaking PyMC models more accessible and reusableMMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?2023-01-31T00:00:00+00:002023-01-31T00:00:00+00:00https://mwburke.github.io/data%20science/2023/01/31/mmm-future-or-liability<h1 id="mmm-miracle-of-marketing-measurement-or-misleading--modeling-methodology">MMM: Miracle of Marketing Measurement or Misleading Modeling Methodology?</h1>
<h2 id="tldr">TL;DR</h2>
<p>MMMs are a future-proofed tool for measuring marketing effectiveness in a world of increased online privacy, but can be prone to multiple silent failure methods that provide inaccurate results.</p>
<h2 id="what-is-an-mmm">What is an MMM?</h2>
<p>Media mix modeling is a statistical technique used to understand the effectiveness of different advertising and marketing channels, such as television, print, digital, and so on. The goal of media mix modeling is to determine the optimal allocation of a company’s advertising budget across different channels to maximize the return on investment (ROI). This is done by analyzing historical data marketing expenditures and conversions, and using statistical models to estimate the incremental impact of different advertising channels on conversions.</p>
<p>Businesses use media mix modeling to gain insights into the effectiveness of their advertising campaigns and to make data-driven decisions about where to allocate their advertising budget. By understanding the ROI of different advertising channels, companies can optimize their marketing strategy to maximize their sales and revenue. Additionally, media mix modeling can also help businesses identify underperforming channels and make adjustments to their advertising strategy accordingly. It also can help them to understand the changes of the market and the effectiveness of their strategy.</p>
<h2 id="privacy-proof">Privacy Proof?</h2>
<p>If you haven’t noticed the incessant cookie tracking messages on every website, the world is heading on a trend of increased data privacy online due to legislation such <a href="https://gdpr-info.eu/">GDPR</a> or <a href="https://oag.ca.gov/privacy/ccpa">CCPA</a>, and Apple’s <a href="https://developer.apple.com/app-store/user-privacy-and-data-use/">iOS 14</a> privacy changes. What this means for advertisers is that they will receive less and less user impression level data. While your company may still be able to extract aggregate-level insights from <a href="https://www.appsflyer.com/resources/guides/data-clean-rooms/">data clean rooms</a>, mapping conversions to specific users will no longer be possible. The most obvious impact is removing the possibility of maintaining a <a href="https://segment.com/academy/advanced-analytics/an-introduction-to-multi-touch-attribution/">multi-touch attribution</a> pipeline to assign credit to user impression for conversions.</p>
<h3 id="evergreen-data">Evergreen Data</h3>
<p>MMMs, however, don’t rely on user-level data, and instead rely on first party data that companies will always have: namely what money went where. Marketing teams will always have budgets to track what advertising dollars were spent on which channels and campaigns, as well as conversions tracked to report revenue. These are the core data that are fed into an MMM, and as such, I don’t see them disappearing at any point in the future. In some cases, impressions are often preferred over marketing spend as input data to the model due to reducing fluctuations in advertising CPA, but my guess is that advertisers will continue to provide at least a high level estimation of impression magnitude to keep their customers happy.</p>
<h2 id="the-pitfalls">The Pitfalls</h2>
<p>With our options for measurement dwindling, why be pessimistic about MMM? It seems like the one stable methodology that will see widespread use. The main reasons revolve around the fact that it’s a small data problem with a big emphasis on slow but precise data collection, and the potential for silent methodology errors.</p>
<h3 id="lack-of-quality-data-input">Lack of Quality Data Input</h3>
<h4 id="not-enough-data">Not Enough Data</h4>
<p>Data collection can’t be rushed or automated with label training, but is collected naturally over time as your company markets its products or services. For some digital channels/advertisers, you may be able to collect both spend and impressions data down to a daily basis, whereas for traditional non-digital channels such as mail or print, you may be limited to a monthly cadence.</p>
<p>Given that your data inputs must be fed into the model at the least granular level, your company may take years to gain enough data to reliably estimate performance. Assume for example that you want to estimate the carryover and saturation effects at a channel level, and depending on the fit forms you use to model them, this could be from 2-5 parameters for each channel. If you have 10+ channels, you can see how this will take a length amount of time until you even have more data points than parameters, let alone to reduce uncertainty to a reasonable amount.</p>
<p>New channels equally struggle with this problem of small data and extracting seasonal effects. Often, one solution for this is to perform additional testing and apply the derived assumptions rather than inferring parameters.</p>
<h4 id="not-enough-data-variance">Not Enough Data Variance</h4>
<p>One rookie mistake new marketing divisions may make is to spend proportionally on channels month over month. Although the total spend in a time period may shift, if you spend the same proportions in each channel over time, you end up not having enough variance between channels and end up with an entirely collinear dataset. Any model building after that will run into serious, unresolvable modeling issues. This is just a classic example of the explore/exploit paradigm where you need to add enough randomness to discern between channels while still maximizing profit. Fortunately, there may be natural barriers through ad planning and payment that prevent this issue entirely, but it’s something to monitor from a central planning team.</p>
<h4 id="missing-covariates">Missing Covariates</h4>
<p>Marketing spend only accounts for a portion of all conversions with the remaining conversions being attributed to a number of control factors. These could include natural, brand recognition, economic factors, total available market, competitor advertising/brand effects, etc. Without knowledge beforehand whether they correlate with your target, it can be difficult to decide whether or not to invest in data collection or purchase. Sometimes known factors may be unavailable due to timing granularity or proprietary access. From my point of view, this is the primary reason using a third party vendor to build an MMM is valuable, because their expertise and data collection infrastructure may overcome some of these limitations and augment your data with crucial factors.</p>
<h3 id="misinterpretation-and-misspecification">Misinterpretation and Misspecification</h3>
<p>MMMs have been around for a long time, and the amount of domain expertise developed has led to a few common solutions being preferred. The technical implementation of those has become almost trivial, and, in my opinion, the design of your model is entirely based around the set of assumptions you apply. This is advantageous if you fully understand the problem space, but without appropriate experience or data access, a lot of these assumptions will have to be made in an ad hoc manner or on gut feel.</p>
<h4 id="domain-assumptions">Domain Assumptions</h4>
<p>Before delving into the modeling process, you have to specify a number of domain specific assumptions including whether or not channel advertising effects are additive or multiplicative and what functional forms saturation curve and carryover lag effects should take. It would take extensive time and resources to actually test and validate all of these assumptions, and, unless you’re a large marketing firm that is providing MMM as a service, it’s unlikely you will ever be able to validate your assumption choices. In this case, you are accepting a risk of building a model that doesn’t correspond to reality and will always be wrong in an unmeasurable amount, but from what I’ve seen when reading guides and resources online, this risk isn’t highlighted nearly enough.</p>
<h4 id="forecasting-assumptions">Forecasting Assumptions</h4>
<p>Well-specified MMMs take into account non-media control variables to isolate incremental conversions from marketing spend from all other effects. In my experience, planning based on MMM results takes into account only the incremental marketing ROI measurements and fails to consider all of the control variables used as well. However, just as in life, no decision is still a decision, and not taking any changes into account is close to extrapolating the current situation forwards. I’m not necessarily saying that trying to predict future economic conditions given the volatility of Covid, inflation, recession (?), and interest rates is advisable or even possible, but focusing entirely on media effects is just another way to have silent, unmeasurable errors.</p>
<h4 id="understanding-time-period-application">Understanding Time Period Application</h4>
<p>Most MMMs until now have tended to use all historical time periods in the modeling process, and to update it, data since the last run is collected and appended to the full dataset. The outputs are analyzed and used to optimize marketing spend for an upcoming period, commonly a few months out. The fact that we are only learning a set of parameters for each channel over the whole time is highly concerning to me, given how we know that businesses shift their strategies, advertising effectiveness and scale (hopefully) over time. Yes, if the curve was static, we would be gaining more data points on curve and decreasing our uncertainty in our estimate, but assuming that all channels’ parameters have remained constant over months or years seems unreasonable to me.</p>
<p>There seem to be two main methods of combatting this issue. The first is <a href="https://stats.stackexchange.com/questions/454415/how-to-account-for-the-recency-of-the-observations-in-a-regression-problem">recency weighting</a>, which is a method of giving larger importance to later observations in inferring channel parameters, and while it is still able to leverage learnings from earlier time periods, it better represents recent trends to help align with planning. The other <a href="https://arxiv.org/pdf/2106.03322.pdf">time varying coefficients</a> which captures the parameter at different points in time, and seems to me to be powerful enough to become standard practice for MMM.</p>
<h2 id="conclusion-aka-my-bad-opinions">Conclusion (aka My Bad Opinions)</h2>
<p>MMM still can enormously useful but requires thoughtful planning and review from domain experts and data scientists before digging into the actual modeling portion. If you can’t provide that level of support for data and assumptions validation, it may make more sense for you to purse a third-party marketing firm to leverage their skills instead.</p>
<p>The good news is that big tech seems to be realizing the adpocalypse is continuing and are beginning to invest in MMM tools. Here are some that I think are valuable:</p>
<ul>
<li><a href="https://github.com/google/lightweight_mmm">Google’s Lightweight MMM</a>
<ul>
<li>This is a pretty feature complete python package for building Bayesian MMMs</li>
<li>It has convenient features such as geo-level modeling, preprocessing and budget optimization</li>
</ul>
</li>
<li><a href="https://facebookexperimental.github.io/Robyn/">Facebook’s Robyn</a>
<ul>
<li>Very powerful and feature rich MMM building tool in R.</li>
<li>In my opinion, this seems like the best option for developing your own MMM if you have resources to dive deep into it and aren’t full committed to only python</li>
</ul>
</li>
<li><a href="https://github.com/uber/orbit">Uber’s Orbit</a>
<ul>
<li>This is a python library that implements the bayesian time-varying coefficients for time series forecasting</li>
<li>If you want less functionality and are prioritizing time varying coefficients, this is the library for you</li>
<li>Not a MMM-specific library so you are missing a lot of the preprocessing and diagnostic tools found in Robyn, but if you are willing to implement those yourself (they aren’t rocket science), then this is a great option</li>
</ul>
</li>
</ul>
<p>I mostly put this together to capture areas of discussion around MMM that I felt weren’t being addressed properly, and if you have any tips on addressing these problems, please feel free to share them here or elsewhere on the internet. I’m sure lots of folks are working on the same types of problems and increasing literature around the practical development of MMMs would help the industry a ton.</p>Matthew Burkematthew.wesley.burke@gmail.comMMMs are a future-proofed tool for measuring marketing effectiveness in a world of increased online privacy, but can be prone to multiple silent failure methods that provide inaccurate results.Spaghetti Code2022-02-23T00:00:00+00:002022-02-23T00:00:00+00:00https://mwburke.github.io/generative%20art/2022/02/23/making-spaghetti-code<p>As I mentioned in an <a href="/generative%20art/2019/06/18/basic-tiling.html">earlier post</a> about tiling patterns, I really like truchet tilings. Each border of the tiles that traditionally have patterns that connect the midpoints of adjacent sides. I explored this concept a long time ago, but instead of using a traditional square grid, I used a hexagonal one instead.</p>
<blockquote class="instagram-media" data-instgrm-permalink="https://www.instagram.com/p/CAHOcTvAwKK/?utm_source=ig_embed&utm_campaign=loading" data-instgrm-version="14" style=" background:#FFF; border:0; border-radius:3px; box-shadow:0 0 1px 0 rgba(0,0,0,0.5),0 1px 10px 0 rgba(0,0,0,0.15); margin: 1px; max-width:540px; min-width:326px; padding:0; width:99.375%; width:-webkit-calc(100% - 2px); width:calc(100% - 2px);"><div style="padding:16px;"> <a href="https://www.instagram.com/p/CAHOcTvAwKK/?utm_source=ig_embed&utm_campaign=loading" style=" background:#FFFFFF; line-height:0; padding:0 0; text-align:center; text-decoration:none; width:100%;" target="_blank"> <div style=" display: flex; flex-direction: row; align-items: center;"> <div style="background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 40px; margin-right: 14px; width: 40px;"></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 100px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 60px;"></div></div></div><div style="padding: 19% 0;"></div> <div style="display:block; height:50px; margin:0 auto 12px; width:50px;"><svg width="50px" height="50px" viewBox="0 0 60 60" version="1.1" xmlns="https://www.w3.org/2000/svg" xmlns:xlink="https://www.w3.org/1999/xlink"><g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd"><g transform="translate(-511.000000, -20.000000)" fill="#000000"><g><path d="M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631"></path></g></g></g></svg></div><div style="padding-top: 8px;"> <div style=" color:#3897f0; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:550; line-height:18px;">View this post on Instagram</div></div><div style="padding: 12.5% 0;"></div> <div style="display: flex; flex-direction: row; margin-bottom: 14px; align-items: center;"><div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(0px) translateY(7px);"></div> <div style="background-color: #F4F4F4; height: 12.5px; transform: rotate(-45deg) translateX(3px) translateY(1px); width: 12.5px; flex-grow: 0; margin-right: 14px; margin-left: 2px;"></div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(9px) translateY(-18px);"></div></div><div style="margin-left: 8px;"> <div style=" background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 20px; width: 20px;"></div> <div style=" width: 0; height: 0; border-top: 2px solid transparent; border-left: 6px solid #f4f4f4; border-bottom: 2px solid transparent; transform: translateX(16px) translateY(-4px) rotate(30deg)"></div></div><div style="margin-left: auto;"> <div style=" width: 0px; border-top: 8px solid #F4F4F4; border-right: 8px solid transparent; transform: translateY(16px);"></div> <div style=" background-color: #F4F4F4; flex-grow: 0; height: 12px; width: 16px; transform: translateY(-4px);"></div> <div style=" width: 0; height: 0; border-top: 8px solid #F4F4F4; border-left: 8px solid transparent; transform: translateY(-4px) translateX(8px);"></div></div></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center; margin-bottom: 24px;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 224px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 144px;"></div></div></a><p style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; line-height:17px; margin-bottom:0; margin-top:8px; overflow:hidden; padding:8px 0 7px; text-align:center; text-overflow:ellipsis; white-space:nowrap;"><a href="https://www.instagram.com/p/CAHOcTvAwKK/?utm_source=ig_embed&utm_campaign=loading" style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:normal; line-height:17px; text-decoration:none;" target="_blank">A post shared by Matthew Burke (@yot_club_)</a></p></div></blockquote>
<script async="" src="//www.instagram.com/embed.js"></script>
<p>Each hexagon has every side connected to another side, and by repeating these connections, you end up with some emergent structures of how the paths travel over time. I like calling them noodles, so that’s what I’ll do here from now on.</p>
<p>My original implementation was… a bit of disaster from the code perspective. I left it for a while, and when I went back to revisit it to color each of these noodles individually, I found myself confused by my own work and gave up. Later on, I still felt the urge to do improve upon my initial version and recreating it from the ground up seemed like the fastest option.</p>
<h1 id="new-implementation">New Implementation</h1>
<h2 id="paths-generation">Paths Generation</h2>
<p>I built upon my initial idea of defining a grid of flat-top hexagons, and added some definitions to keep things clear. Each side of the hexagon is assigned an index integer, as well as a list of integer pairs to define which sides of the hexagon are connected. Each hexagon could contain zero to three sets of connections, with their order denoting the order in which they would be drawn to the screen.</p>
<p><img src="/images/spaghetti_3.png" alt="" /></p>
<p>Once we generate connections for each of the hexagons, we iterate over all the hexagons and over all the connections for each hexagon. We start at the first connection, and if it has not been visited, recursively follow the path from one hexagon to the next, using the side indices we defined earlier to choose the next hexagon, until we reach an end, a visited path, or the starting point. As we traverse them, we mark each of the connections as visited, so that when we reach that connection later on, we will not pursue it. One note is that, since we don’t know whether or not we are starting at an endpoint, we traverse the path in two different directions, and just swap the connection orders for one of the directions before combining.</p>
<p>The result for following a single path in how it maps across hexagons can be shown below:</p>
<p><img src="/images/spaghetti_0.png" alt="" /></p>
<p>We do this for all hexagons and all connections, collecting them into path objects with their hexagon locations and x/y coordinates. Once this is finished, we have all our paths and just need to cap them with semicircles of the correct end colors. Here is a final render of all paths being drawn without caps, overlaid with the hexagon grid that underlies them all.</p>
<p><img src="/images/spaghetti_2.png" alt="" /></p>
<p>At this point, we have all we need and just need to assign the colors and work on generating interesting patterns.</p>
<h2 id="style-generation">Style Generation</h2>
<p>While there were dozens of ideas I could have explored, I get overwhelmed with too many options so I limited myself to the following areas to delve deeper into and came up with the following final list of constrainted parameters that I chose to work within.</p>
<ul>
<li>Colors
<ul>
<li>Overall palette
<ul>
<li>What do we want to use for the background and noodle colors?</li>
<li>Even if a palette is good, does it have enough contrast to distinguish between foreground/background?</li>
<li>Does the palette have muddy colors if we have a gradient or do the in between colors look good as well?</li>
</ul>
</li>
<li>Noodle style
<ul>
<li>Color it directly or add an outline surrounding it
<ul>
<li>If we have an outline, do we keep a static one across all noodles or pick a different outline color for each?</li>
</ul>
</li>
<li>Color the noodle with a single color or make a color changing graadient
<ul>
<li>If we have a gradient, do we rotate through the palette in order or assign sequential colors randomly?</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Path shapes
<ul>
<li>Connection patterns
<ul>
<li>Fully random for each hexagon
<ul>
<li>Do we keep three sets of connections for each one or do we have a lower amount?</li>
</ul>
</li>
<li>Stripes along horizontal, vertical and diagonals</li>
<li>Chunks of MxN hexagon groups that are repeated along horizontal, vertical and diagonals with or without offsets</li>
</ul>
</li>
<li>Border
<ul>
<li>Do we feed connections back to the middle or just truncate?</li>
</ul>
</li>
<li>Curve polygon fidelity
<ul>
<li>Closer to fully round</li>
<li>Low poly</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>Ultimately, I think my favorites were gradient noodles without outlines that either had random connections or repeating chunk patterns.</p>
<blockquote class="instagram-media" data-instgrm-permalink="https://www.instagram.com/p/CZqAMbSvnzQ/?utm_source=ig_embed&utm_campaign=loading" data-instgrm-version="14" style=" background:#FFF; border:0; border-radius:3px; box-shadow:0 0 1px 0 rgba(0,0,0,0.5),0 1px 10px 0 rgba(0,0,0,0.15); margin: 1px; max-width:540px; min-width:326px; padding:0; width:99.375%; width:-webkit-calc(100% - 2px); width:calc(100% - 2px);"><div style="padding:16px;"> <a href="https://www.instagram.com/p/CZqAMbSvnzQ/?utm_source=ig_embed&utm_campaign=loading" style=" background:#FFFFFF; line-height:0; padding:0 0; text-align:center; text-decoration:none; width:100%;" target="_blank"> <div style=" display: flex; flex-direction: row; align-items: center;"> <div style="background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 40px; margin-right: 14px; width: 40px;"></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 100px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 60px;"></div></div></div><div style="padding: 19% 0;"></div> <div style="display:block; height:50px; margin:0 auto 12px; width:50px;"><svg width="50px" height="50px" viewBox="0 0 60 60" version="1.1" xmlns="https://www.w3.org/2000/svg" xmlns:xlink="https://www.w3.org/1999/xlink"><g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd"><g transform="translate(-511.000000, -20.000000)" fill="#000000"><g><path d="M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631"></path></g></g></g></svg></div><div style="padding-top: 8px;"> <div style=" color:#3897f0; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:550; line-height:18px;">View this post on Instagram</div></div><div style="padding: 12.5% 0;"></div> <div style="display: flex; flex-direction: row; margin-bottom: 14px; align-items: center;"><div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(0px) translateY(7px);"></div> <div style="background-color: #F4F4F4; height: 12.5px; transform: rotate(-45deg) translateX(3px) translateY(1px); width: 12.5px; flex-grow: 0; margin-right: 14px; margin-left: 2px;"></div> <div style="background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(9px) translateY(-18px);"></div></div><div style="margin-left: 8px;"> <div style=" background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 20px; width: 20px;"></div> <div style=" width: 0; height: 0; border-top: 2px solid transparent; border-left: 6px solid #f4f4f4; border-bottom: 2px solid transparent; transform: translateX(16px) translateY(-4px) rotate(30deg)"></div></div><div style="margin-left: auto;"> <div style=" width: 0px; border-top: 8px solid #F4F4F4; border-right: 8px solid transparent; transform: translateY(16px);"></div> <div style=" background-color: #F4F4F4; flex-grow: 0; height: 12px; width: 16px; transform: translateY(-4px);"></div> <div style=" width: 0; height: 0; border-top: 8px solid #F4F4F4; border-left: 8px solid transparent; transform: translateY(-4px) translateX(8px);"></div></div></div> <div style="display: flex; flex-direction: column; flex-grow: 1; justify-content: center; margin-bottom: 24px;"> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 224px;"></div> <div style=" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 144px;"></div></div></a><p style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; line-height:17px; margin-bottom:0; margin-top:8px; overflow:hidden; padding:8px 0 7px; text-align:center; text-overflow:ellipsis; white-space:nowrap;"><a href="https://www.instagram.com/p/CZqAMbSvnzQ/?utm_source=ig_embed&utm_campaign=loading" style=" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:normal; line-height:17px; text-decoration:none;" target="_blank">A post shared by Matthew Burke (@yot_club_)</a></p></div></blockquote>
<script async="" src="//www.instagram.com/embed.js"></script>
<h2 id="resources">Resources</h2>
<ul>
<li><a href="https://observablehq.com/@osteele/truchet-tile-generation">Generalized Truchet Tiles</a></li>
<li><a href="https://www.redblobgames.com/grids/hexagons/">Hexagonal Grids - Red Blob Games</a></li>
</ul>Matthew Burkematthew.wesley.burke@gmail.comMaking generative noodle artCodenames Clue Generator using Semantic Similarity2021-12-12T00:00:00+00:002021-12-12T00:00:00+00:00https://mwburke.github.io/data%20science/2021/12/12/codenames-clue-generator-version-1<p>In this post, I’ll talk about how I built a clue generator for the game Codenames that provides a list of potential clues, numbers and associated target words, all with Tensorflow.</p>
<p>My day job is mostly internally facing and so I took this on as a way to practice building product-focused data science projects.</p>
<h1 id="how-codenames-works">How Codenames Works</h1>
<p>If you already know how the game works, feel free to skip or read again for a quick reminder.</p>
<p>Codenames is a card game with 2 teams. There are 25 cards laid out on the board, 9 belonging to one team, 8 belonging to another, 7 neutral and 1 double agent card.</p>
<p>Each time has a <strong>codemaster</strong> that can see which cards belong to which teams, and the remaining members of the teams are <strong>spies</strong> that only see a single word on each card.</p>
<p>The teams take turns having the codemaster provide a clue to their team made up of a single word and a number, with the clue relating to the number of cards on the board. The goal is to get the team to guess which words the clue is indicating, and they select cards to turn over.</p>
<p>If they select a card belonging to their team, they can continue guessing, but if they flip over a card that doesn’t, their turn is immediately ended and they could suffer the negative consequences of potentially flipping over the other team’s card, bringing them closer to their goal, or flipping over the double agent card and instantly losing the game.</p>
<p>Thus, the codemaster seeks to find clues that maximize the relationship to words on their team and minimize the relationship to words on the other team. Additionally, by finding clues with a larger number of cards it relates to, they can increase their chance of beating the other team by finishing first, but they risk having a lower relevance to each of the target cards and higher chance of accidentally missing a connection for opposing cards.</p>
<h1 id="how-i-built-it">How I Built It</h1>
<h2 id="project-goals">Project Goals</h2>
<p>We represent a current board and team state with the following inputs:</p>
<ul>
<li>Current team’s cards</li>
<li>Opposing team’s cards</li>
<li>Neutral cards</li>
<li>Double agent card</li>
</ul>
<p>What we are looking for is a list of potential clues the codemaster could use with the following fields:</p>
<ol>
<li>Clue word</li>
<li>Clue number</li>
<li>Target words the clue is intended to relate to</li>
<li>Quantitative measure of the quality of the clue</li>
</ol>
<h2 id="quantifying-clue-quality">Quantifying Clue Quality</h2>
<p>As with most data science problems, the hardest part if quantifying exactly what you are looking to maximize or predict. In this case, we have a vague notion of maximize and minimizing relevance of our clue word to words on the board. While there are many ways to do this, the way I chose to frame it for now is in terms of embeddings.</p>
<h3 id="word-embeddings">Word Embeddings</h3>
<p><a href="https://en.wikipedia.org/wiki/Word_embedding">Word embeddings</a> are a way to represent words quantitatively with a list of numbers, which we will refer to here as a vector. The main idea is that words with similar meanings will have similar number representations, and that related words will have a similar relationship. For example, woman -> man should have a similar relationship as queen -> king. Or Pooh -> Tigger should have a similar relationship as bear -> tiger (ok maybe this one’s a bit of a stretch, but you get the picture).</p>
<p>Rather than generating my own, I used a pre-trained model from Tensorflow, the <a href="https://tfhub.dev/google/Wiki-words-500/2">Wiki-words-500</a> text embedding that already generated a mapping from words to their vector representations. I now have a function to translate any given english word into a vector of length 500.</p>
<p>Please see the end for discussions about future improvements related to choosing a embedding corpus.</p>
<h3 id="word-relevance---cosine-similarity">Word Relevance - Cosine Similarity</h3>
<p>Having numerical representations of words is a start, but what we really care about is the relationships between words. We need to compare the vectors to begin to use them.</p>
<p>When comparing vectors, you will often hear the language of <strong>distance</strong> and <strong>similarity</strong>, which are two sides of the same coin, meaning difference and closeness of two vectors, respectively. For certain types of distances, we may just subtract the value from one to switch between the two.</p>
<p>For this case, I chose to work with <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>, although I may look into other options in the future. This gives us a single number ranging from -1 to 1, with -1 indicating two words’ being as dissimilar as possible and 1 being equivalent.</p>
<h3 id="clue-quality-metric">Clue Quality Metric</h3>
<p>In order to summarize clue quality in a single number, we consider the benefits and penalties associated with the outcome of guessing a card on the table. Obviously, we want to incentivize choosing clues that are relevant to our team and decentivize other cards, with increasing penalties for the undesirable outcomes. Neutral ends our turn, the opposing team’s card ends our turn and advances them to the goal, and the double agent loses the game.</p>
<p>The way we summarize this is by multiplying the cosine similarity for each card on the table by a set of coefficients that represent these benefits/penalties. The process is as folows:</p>
<ol>
<li>Extract word bank embeddings and cache since they will be reused for all games</li>
<li>Get current game word embeddings</li>
<li>Calculate cosine similarity between all game words and all word bank words</li>
<li>Multiply similarity scores by appropriate card type coefficients</li>
<li>Sum up all final scores for each word bank word to get clue quality metric</li>
</ol>
<p>This can all be accomplished very quickly with Tensorflow using their pre-trained embeddings and a series of matrix multiplications.</p>
<h4 id="clue-size">Clue Size</h4>
<p>We do have an additional constraint to limit the number of words that the clue relates to, which changes how we think about the quality metric. The overall structure remains the same, but we need some way to determine which of our team’s cards to include in the clue.</p>
<p>The way I implemented it was to set a similarity threshold and only keep clues that have a similarity value equal to or greater than the threshold. This is the most straightforward way, and it ensures a global level of relevance. Of course, this introduces another parameter to tweak that we don’t have an exact way to measure the effectiveness of, and we do run the risk of excluding relevant clues that fall right below the cutoff. However, as problems go, having your team select another one of their cards is a decent one to have, although it may cause confusion later down the line.</p>
<p>The above process for calculating the quality metric remains the same as above, but, first we go through and remove all cards below the similarity threshold, and then calculate the contribution of the remaining ones towards our metric.</p>
<h4 id="setting-coefficients">Setting Coefficients</h4>
<p>It’s clear that we want a positive coefficient for our cards and monotonically decreasing negative coefficients for opposing, neutral and double agent cards respectively, but it’s not obvious exactly what they should be for several reasons:</p>
<ol>
<li>All of the coefficients are relative to one another so there isn’t a single global optimum</li>
<li>We are codifying the codemaster’s risk preferences to a single set of numbers
<ul>
<li>Some people may have a higher risk tolerance for clues similar to the double agent card, or they may never want to even have a small chance of guessing it</li>
</ul>
</li>
<li>The number of cards in each category changes over the course of the game
<ul>
<li>We may need to scale the contributions of remaining team/opposing cards. If both teams are guessing accurately, there will be few cards belonging to them and a higher concentration of neutral cards.</li>
<li>This may dilute the quality metric by having it be mostly composed of negative scores. The clues will mostly be avoiding the other cards rather than leaning towards the remaining cards</li>
<li>It remains to be seen if this problematic, or if at that point, the codemaster no longer needs to rely on a clue generator since the problem space is much smaller</li>
</ul>
</li>
<li>We don’t have a clear metric on how to evaluate the effectiveness of the metric as of now</li>
</ol>
<h2 id="solution-validation--testing-plans">Solution Validation / Testing Plans</h2>
<p>Number 4 above is the elephant in the room: <strong>How do we know our solution is effective?</strong></p>
<p>The ideal method would be to test a bunch of games with randomly assigned teams, and provide the test teams with access to the clue recommendations. Our expectation is that the win rates would be equal between groups, and any significant difference would be driven by access to the tool. I would rather test giving tool access, but not mandating usage, because that’s a more realistic scenario in practice than forcing them to use the top recommendations every time. At this point, I don’t think we would consistently beat player intuition, so it’s not a valid comparison. However, the time required to get volunteers and acquire data seems impractical, so are there any other ways we can perform testing?</p>
<h3 id="backtesting">Backtesting</h3>
<p>If you run a codenames online site with textual clue inputs, you could backtest and see how many times the clues recommended by users would have been recommended by the tool.There are multiple metrics used in recommender systems you could use to evaluate performance including <a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain">NDCG</a> or an adapted version of <a href="http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html">Mean Average Precision</a>.</p>
<p>Regardless of what method you use, there are several problems:</p>
<ol>
<li>Sometimes people give bad clues. I’ve done it, others do it. It’s fairly common. How will this affect our scores?
<ul>
<li>We could potentially do some censoring to only include clues where the codemaster’s team guessed all of the associated words correctly if we had access to it. We could determine whether or not they guessed the correct amount of clues, but as far as I’ve seen, online sites don’t seem to have tagging for relevant words to clues. At the very least, it would be a more fair comparison, even if there’s still a known source of error.</li>
</ul>
</li>
<li>The recommender word bank may include many words not in the common vernacular that are still relevant. Should they be penalized just because they’re niche?</li>
<li>We don’t have any proper nouns in our word bank. These can be very effective: think Potter for ceramic and magic as an example.</li>
</ol>
<h3 id="mechanical-turk">Mechanical Turk</h3>
<p>A common way to generate datasets for bespoke targets is through <a href="https://www.mturk.com/">Amazon Mechanical Turk</a> , where you can get people to complete arbitrary tasks online for money. This often is used in ML to generate labels for unsupervised data such as images or natural language.</p>
<p>In this case, proper evaluation takes a fair amount of background understanding of the game just to be able to make evaluations, and for accurate evaluations, experience actually playing. Given the cost of getting random people to take time to learn a new game, confirm that their understanding is accurate, and then to actually play test games would be exorbitant, we need to modify our method into easier to consume subtasks that are proxies for clue quality.</p>
<p>I propose that we could potentially focus on getting people to evaluate clue similarity or dissimilarity to a set of words. This could be done either as choosing the most/least relevant clue to a set of words from a list of potential clues, or providing a clue and bank of words, and having them choose the most/least relevant words to the clue. This removes the need to evaluate multiple objectives simultaneously, and increases the amount of data we could collect per dollar. Evaluation would be between existing versions of the clue generator, or between existing game samplesa dn the clue generator.</p>
<p>Again, this suffers from not actually evaluating performance on the game metrics, but, once we have an existing solution we deem is working well, we could use it as a way to test champion/challenge models on specific parts of the quality score (similarity to team words, dissimilarity to all other words).</p>
<h1 id="future-work">Future Work</h1>
<p>If not obvious by now, there are a lot of potential areas for improvement that I would like to pursue given time, but here are some of the main ones:</p>
<h2 id="graph-based-similarity">Graph-Based Similarity</h2>
<p>The current approach suffers from words with multiple meanings, the curse of dimensionality, a lack of concrete, objective measurements of similarity, and proper nouns in the word bank.</p>
<p>Switching to a knowledge graph, or even web-search <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a> like approach would help shore up the above problems and maybe be used in tandem with semantic similarity recommendations if not replacing it entirely.</p>
<h2 id="word-embeddings-1">Word Embeddings</h2>
<ul>
<li>Additional research into more appropraite pre-trained word embeddings</li>
<li>Generate our own embeddings by training an NLP model on a corpus we designed for this</li>
</ul>
<h2 id="quality-metric">Quality Metric</h2>
<ul>
<li>Add a relative score component for clue selection
<ul>
<li>Using an elbow method similar to identifying the appropriate number of clusters?</li>
</ul>
</li>
<li>Scaling based on number of cards still available to deal with clue dilution of team’s cards compared to other cards</li>
</ul>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://codenames.game/">Codenames Online Game</a></li>
<li>Word Embeddings:
<ul>
<li><a href="https://machinelearningmastery.com/what-are-word-embeddings/">Machine Learning Mastery: What Are Word Embeddings</a></li>
<li>Tensorflow has a <a href="https://www.tensorflow.org/text/guide/word_embeddings">guide to working with embeddings</a> in a neural network for those who work in ML/NLP.</li>
</ul>
</li>
<li><a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine Similarity</a></li>
</ul>Matthew Burkematthew.wesley.burke@gmail.comUtilizing Tensorflow pre-trained embeddings to recommend potential clues to the codemasters in the card game CodenamesBasic Geometric Tiling2019-06-18T00:00:00+00:002019-06-18T00:00:00+00:00https://mwburke.github.io/generative%20art/2019/06/18/basic-tiling<h1 id="geometric-tiling">Geometric Tiling</h1>
<p>I went on vacation to Italy recently, and while I was there, I fell in love with the mosaic tilings in the Cathedral of Santa Maria del Fiore and Baptistery of St. John in Florence. In general, I’m a huge fan of geometric design, but the designs reallly caught my eye, and I did my best to recreate some of them in processing with some nonstandard color palettes:</p>
<p><img src="/images/italy_mosaic_1.png" alt="" /></p>
<p><img src="/images/italy_mosaic_2.png" alt="" /></p>
<p>If these piqued your interest, I’d recommend checking out more <a href="https://mwburke.github.io/generative-art/posts/030.html">at my generative art site</a>, or much better, go visit Florence yourself and get inspired!</p>
<p>Of course, once I returned home and was talking about how beautiful the tiling was, I was informed about <a href="https://en.wikipedia.org/wiki/Islamic_geometric_patterns">Islamic geometric patterns</a>, which blew Italy out of the water in terms of complexity and creativity. I definitely will be reviewing my future travel plans in light of this discovery, and in the meantime, I hopefully can learn more about their theory and history to get a better appreciation of them.</p>
<h2 id="organic-tiling-truchet-patterns">Organic Tiling: Truchet Patterns</h2>
<p>While geometric patterns are always awesome, I had recently run into <a href="https://christophercarlson.com/portfolio/multi-scale-truchet-patterns/">this article</a> talking about truchet patterns, and wanted to try something a little more rounded and actually generative to see if it had a more “organic” feel about it.</p>
<p>The idea behind them, is that they are square tiles with round internal paths/connections that can be connected to any other tile pattern. It didn’t take long to create each one of them, but after just generating a random tileset, the results are rather unsatisfying:</p>
<p><img src="/images/truchet_pattern_4.png" alt="" /></p>
<p>This is pretty much inline with what I have been viewing and reading from well-known generative artists, and so I took a stab at creating a little more structure into the process by nesting squares of patterns within each other and was quite pleased:</p>
<p><img src="/images/truchet_pattern_1.png" alt="" /></p>
<p><img src="/images/truchet_pattern_3.png" alt="" /></p>
<p>I took it a step further, and while still limiting the available tiles and placing them diagonally, I allowed the rotation vary. These might be some of my favorite results in that they’re not so random as to be without structure, but it seems more natural:</p>
<p><img src="/images/truchet_pattern_2.png" alt="" /></p>
<p>There’s definitely more work I could do with utilizing the smaller subtiling and larger amount of tiles, but I’ll put that ahead for future work. If you’re interested in seeing more of these patterns, you can <a href="https://mwburke.github.io/generative-art/posts/032.html">do so here</a> and create as many as you want!</p>
<p>I hope to do some more work on hexagonal tiling with connections based on node-based growth algorithms, which I think have a lot of potential for walking the line between structure and chaos.</p>
<h3 id="resources">Resources:</h3>
<ul>
<li><a href="https://christophercarlson.com/portfolio/multi-scale-truchet-patterns/">Multi-Sscale Truchet Patterns - Christopher Carlson</a></li>
</ul>Matthew Burkematthew.wesley.burke@gmail.comInspiration from Italian MosaicsMulti-Armed Bandits Exploration2019-06-18T00:00:00+00:002019-06-18T00:00:00+00:00https://mwburke.github.io/data%20science/2019/06/18/bandits-exploration<h2 id="multi-armed-bandit-overview">Multi-Armed Bandit Overview</h2>
<p>A multi-armed what?? If you don’t know what the multi-armed bandit problem is, then you may be confused. I’m assuming that you have some background on this for the rest of the post, but if you don’t, here’s a quick rundown:</p>
<p>Pretend you are a someone who’s looking to go gambling, and an old style slot machine (aka bandit, don’t worry about why) you can choose from that has multiple arms. Your goal is (obviously) to make the most amount of money from putting coins into it and pulling the arms. However, given that you can only try one arm at a time, how do you find the arms(s) that give you the most bang for your buck without wasting time on arms that just eat your money?</p>
<p>That’s essentially what the multi-armed bandit problem is. How do we maximize rewards by <em>exploring</em> new arms we don’t know much about (have only played zero or a few times), while still <em>exploiting</em> (or taking advantage of) the arms we already know give us good rewards?</p>
<p>Alright, now that we’ve covered that, we can jump into some code and ways I explored common algorithms used to maximize profits in this scenario.</p>
<h2 id="bandit-definitions">Bandit Definitions</h2>
<p>But first, let’s look again at how the bandits themselves are defined. I played around with two types:</p>
<ol>
<li><strong>Bernoulli Bandit</strong>: each arm in the bandit has a set probability each time it’s pulled of returning a reward of 1 or 0</li>
<li><strong>Gaussian Bandit</strong>: each arm in the bandit has a mean and standard deviation that define a gaussian distribution. When pulled, it samples from that distribution to return a reward.</li>
</ol>
<p>Here’s a quick visualization of means and a standard deviation away from those means to get an idea of the potential overlap in rewards you may get from the gaussian bandits. The x axis is the arm number, and the y axis is the reward distribution.</p>
<p><img src="/images/gaussian_rewards.png" alt="" /></p>
<p>Clearly arms 3-4 are the highest ones, but their rewards still overlap greatly with 2, and it would be tricky to tell which one is best, given the amount of noise when sampling.</p>
<h2 id="execution">Execution</h2>
<p>The methods to choose arms in a programmatic away could be called methods or algorithms or whatever, but since I’ve been exploring reinforcement learning recently, I’m going to call them agents.</p>
<p>At each timestep a few things happen:</p>
<ol>
<li>The agent evaluates its current stored information and chooses an arm to interact with</li>
<li>The agent pulls the chosen arm and receives a reward in return</li>
<li>The agent makes updates to its stored information based on the reward</li>
</ol>
<p>The parts where the different methods differ is mainly in step 1, where they use different methods to choose the arm. Step 3 supports step 1 by updating the stored information, and is similar across most agents with some minor differences.</p>
<h3 id="evaluation-procedure">Evaluation Procedure</h3>
<p>In the following section, I compare agents with different parameters to each other by running an agent against a bandit for a pre-defined number of timesteps repeatedly. By doing this multiple times and tracking the rewards at each timestep, we can get a sense of what average performance we can expect from the agent at each timestep.</p>
<p>Naturally, we should see lower average rewards earlier on since we are still exploring and are uncertain of which arms provide the best value, but what we hope to see is a gradual increase in rewards until we identify the optimal arm, at which point the rewards should flatten out to the average of the optimal arm’s reward.</p>
<p>The two plots I include each with the comparisons track both of the metrics over time:</p>
<ol>
<li>Average reward at each timestep</li>
<li>Percent of times the agent chose the optimal arm at that timestep</li>
</ol>
<p>As you will see, the former can be a rather noisy chart (especially with gaussian reward functions), but the latter results in a smoother chart.</p>
<h2 id="agents">Agents</h2>
<h3 id="epsilon-greedy">Epsilon Greedy</h3>
<p>The epsilon greedy agent is an agent is defined by two parameters: epsilon and epsilon decay.</p>
<p>Every timestep, in order to select the arm to choose, the agent generates a random number between 0 and 1. If the value is below epsilon, then the agent selects a random arm. Otherwise, it chooses the arm with the highest average reward (breaking ties randomly), thus exploiting what it knows.</p>
<p>A higher epsilon results in more exploration (random arm selections), and a lower epsilon results in more exploitation.</p>
<p>Because we may not want to keep the same epsilon over the life of our problem, we introduce the epsilon decay parameter, which decreases the value of epsilon after each timestep. This naturally lends itself towards a high explore approach at the beginning when we are unsure of the arm rewards, and a high exploit approach later on once we have more information.</p>
<p>In theory, this seems like a good idea, but in practice (with noisy rewards), decaying epsilon seems to have slightly lower performance. However, I did not implement a minimum epsilon, which could help by preventing a fully-exploit scenario.</p>
<p>Below is a comparison of some different parameters of epsilon greedy agents:</p>
<p><img src="/images/eps_greedy_rewards.png" alt="" /></p>
<p><img src="/images/eps_greedy_optimal_arms.png" alt="" /></p>
<p>Here is a comparison of the best decay rate I found (ratio of 0.9999 per timestep) with different starting epsilon values.</p>
<p><img src="/images/eps_greedy_decay.png" alt="" /></p>
<h3 id="ucb">UCB</h3>
<p>The upper confidence bound (UCB) agent tracks the average reward for each arm similar to epsilon greedy, but rather than encoding its exploration as a binary random chance, it attempts to measure uncertainty in terms of how long it has been since a arm has been chosen.</p>
<p>Each timestep, the agent chooses the arm with the highest average reward plus “uncerainty”, and the uncertainty for each arm not chosen increases a little bit.</p>
<p>Earlier on, every timestep where a arm is not chosen increases uncertainty by a significant amount. As the system time grows, the uncertainty contributed by each timestep decreases since we should have more accurate estimates of rewards as time progresses.</p>
<p>An important note is that this uncertainty is not what we normally think of in statistics and is <strong>not related to the variance of the reward estimates</strong>.</p>
<p>The influence of the uncertainty factor is determined by a parameter C. Below is a comparison of some runs with different values of C:</p>
<p><img src="/images/ucb_pct_arms.png" alt="" /></p>
<p>One of the main purposes of this repo was to help visualize the UCB agent, in terms of how it balances the average rewards received so far and the uncertainty of unused arms.</p>
<p>Below is as gif of a UCB agent in action. Each frame in the gif is a step where the agent chose an action, received a reward, and updated its estimates/uncertaintities for each arm.</p>
<p>The blue parts of each bar are the average rewards for that arm, and the orange parts are the uncertainty. You should be able to see the blue parts jump around as the highest total blue + orange arm is pulled, while the non-pulled arms’ orange parts should steadily increase until they become the highest bars.</p>
<p>At first, the values will most likely jump around more as the variance of the reward estimates is large, but as it progresses, it should settle into selecting a few arms repeatedly until there is one main winner.</p>
<p><img src="/images/ucb_race_gif.gif" alt="" /></p>
<h3 id="gradient-method">Gradient Method</h3>
<p>The prior two algorithms choose arms based on the average score values, selecting the highest performing one (with some initial exploration). Gradient-based algorithms instead relies on relative preferences for each arm that do not necessarily correspond to actual rewards values. At each timestep, the rewards for an arm are observed, and then an incremental update to the existing preference score is made based on new score and a parameter alpha. This is similar to stochastic gradient ascent, and a larger alpha will result in a larger step size.</p>
<p>The details for updating the preference values \(H_{t}(a)\) for selection probabilities \(\pi_{t}(a)\) selected action \(A_{t}\), rewards \(R_{t}\), and average reward \(\overline{R_{t}}\) are as follows:</p>
<p>\(H_{t+1}(A_{t}) = H_{t}(A_{t}) + \alpha (R_{t} - \overline{R_{t}})(1 - \pi_{t}(A_{t}))\) for action \(A_{t}\) and</p>
<p>\(H_{t+1}(a) = H_{t}(a) - \alpha (R_{t} - \overline{R_{t}})\pi_{t}(a)\) for other actions \(a \neq A_{t}\)</p>
<p>When choosing an arm, the agent passes these arm preferences through the softmax distribution to assign weights to all arms that add up to one. These weights are the probabilities that each arm is chosen. After each step, the average rewards are updated, then the weights for sampling are recalculated.</p>
<p>In case you aren’t familiar, the softmax distribution is as follows: \(P\{A_{t} = a\} = \frac{e^{H_{t}(a)}}{\sum_{b=1}^k e^{H_{t}(b)}}\)</p>
<p>One thing to note is that the average rewards at the start before any weights are input affects the results. Starting all arms out with a value greater than zero will still have an effect of an equal chance for all arms to be selected, but will encourage more exploration in the short term before potentially lowering poorly performing probabilities of being selected almost to zero.</p>
<p><img src="/images/gradient_pct_arms.png" alt="" /></p>
<h2 id="interactive-notebook">Interactive Notebook</h2>
<p>I created a <a href="https://github.com/mwburke/bandits">github repo</a> with all of the code used to generate these plots, with a <a href="https://github.com/mwburke/bandits/blob/master/walkthrough.ipynb">notebook</a> ready to to re-run them and change any parameters so you can get an intuition about how some of these common agent algorithms work.</p>
<p>I’d highly recommend playing around with different numbers of arms, bernoulli rewards, and various levels of noise in the gaussian rewards by increasing and decreasing the standard deviation compared to the means.</p>Matthew Burkematthew.wesley.burke@gmail.comBenchmark Comparisons and UCB VisualizationIntroduction to Idyll2018-12-04T00:00:00+00:002018-12-04T00:00:00+00:00https://mwburke.github.io/data%20visualization/2018/12/04/idyll-pumpkin-taste-test<p>Over Thanksgiving, some friends of mine set out to find the best pumpkin pie recipe and in the process, baked 5 different pies for comparison. After enjoying and ranking them, they decided to open the survey population to let others determine what the truly best pie was with a blind taste test. Being a data nerd himself, my friend tracked all of these responses and passed htem onto me so that I could take a stab at visualizing them with a new data interactive visualization framework I had recently discovered.</p>
<h1 id="idyll"><a href="https://idyll-lang.org/">Idyll</a></h1>
<p><a href="https://idyll-lang.org/">Idyll</a> is, according to their website, “a toolkit for creating data-driven stories and explorable explanations” that makes it simple and quick to create interactive visualizations, and in my opinion, it’s the easiest tool out there to get involved with the communication medium of <a href="https://pudding.cool/process/responsive-scrollytelling/">“scrollytelling”</a>. The base <code class="language-plaintext highlighter-rouge">.idyll</code> file that renders into the final webpage is based on Markdown, but it has a few features that make it an extremely effective tool to prototype quickly but still support more advanced work.</p>
<h2 id="react-integration">React Integration</h2>
<p>One of the most powerful aspects is that it is integrated with React so enable the easy inclusion of pre-made components. It natively has support for a set of simple graphs generated from csv or json files. I wasn’t able to generate what I wanted with these, so I went ahead and added <a href="https://vega.github.io/vega-lite/">vega-lite</a> through npm and within a few minutes had a new chart from my existing data source.</p>
<p><img src="/images/idyll_intro_votes.png" alt="" /></p>
<p>Additionally, it’s fairly straightforward to take existing d3 visualizations, make a few minor modifications, wrap them in a React component and them embed onto your page. In my test, I included a <a href="https://en.wikipedia.org/wiki/Parallel_coordinates">parallel coordinates chart</a> taken directly <a href="https://beta.observablehq.com/@jerdak/parallel-coordinates-d3-v4">an Observable notebook</a>, changed a few lines of CSS and had a working chart much faster than I expected.</p>
<p><img src="/images/idyll_parallel_coordinates.png" alt="" /></p>
<h2 id="property-management">Property Management</h2>
<p>The other fantastic feature of Idyll is the ability to create and manage variables with properties you can both acacess in your different components as wlel as recalculate in real time based on user input.</p>
<p>For example, I can have a variable that can be modified from a variety of pre-made source including a button, slider, text, scroll trigger, etc that can in turn update any visualizations on the page with the new properties. I don’t have to write any additional event listeners and can reuse these properties wherever I want to on the page. I didn’t leverage a ton of these features other than reusing some of my data files and visualization configuration parameters such as width/height, but the possibilities really are endless.</p>
<h1 id="getting-started-with-idyll">Getting Started with Idyll</h1>
<p>If you would also like to get started with Idyll for your own projects, you can take a look at the <a href="/idyll-test-pumpkin/">full post I created with Idyll</a> and <a href="https://github.com/mwburke/idyll-test-pumpkin">the underlying code</a> to see how it was generated, and then head on over to Idyll’s <a href="https://idyll-lang.org/gallery">Example Gallery</a> page to see amazing work on how far you can take this framework.</p>
<h2 id="whats-the-catch">What’s the Catch?</h2>
<p>Tools always have tradeoffs and Idyll embraces a markdown-like language which allows quick development. For more advanced visualizations and custom triggers, it may be worth choosing a more flexible for time-consuming framework to get the exact effects you want.</p>
<p>Additionally, the work only supports single-post rendering as of now, and the user has to create their own process for hosting multiple posts on a single website/platform. There are a few options out there trying to deal with this, but <a href="https://github.com/idyll-lang/idyll/issues/421">according to this github issue</a>, it looks like they’re beginning development to support this.</p>
<h2 id="further-reading">Further Reading</h2>
<p>Here are some great sites to understand the potential of what can really be done with interactive visualization for storytelling and data communication.</p>
<ul>
<li><a href="https://pudding.cool/">The Pudding</a></li>
<li><a href="https://fivethirtyeight.com/">FiveThirtyEight</a></li>
<li><a href="https://www.informationisbeautifulawards.com/news/118-the-nyt-s-best-data-visualizations-of-the-year">The NY Times</a></li>
</ul>Matthew Burkematthew.wesley.burke@gmail.comVisualizing Pumpkin Pie Taste Test ResultsProbability Calibration2018-11-26T00:00:00+00:002018-11-26T00:00:00+00:00https://mwburke.github.io/data%20science/2018/11/26/probability-calibration<h1 id="predictions-as-confidence">Predictions As Confidence</h1>
<p>As you may already know, classification problems in machine learning commonly (though not always) use algorithms that output a <em>predicted probability</em> value that can be used to gauge confidence in how sure your model is that the input belongs to one particular class.</p>
<h1 id="setting-probability-thresholds">Setting Probability Thresholds</h1>
<p>In introductory ML courses, a default value of 0.50 is usually used as the prediction cutoff for making the decision to consider a binary classification output as either positive or negative class, but in industry, selecting the right cutoff threshold is critical to making good business decisions.</p>
<p>If the cost associated with false negatives is large, it may be optimal to use a lower probability decision threshold to capture more positive users at the expense of including more false positives, and vice versa, and the data scientist will work with the business units to balance this tradeoff in order to minimize cost or maximize the benefit. Ultimately, this causes the predictions to act more as a ranking system for applications that require binary classifications as the final output than actually leveraging the values themselves.</p>
<h1 id="interpretation-problems">Interpretation Problems</h1>
<h2 id="model-as-ranking">Model As Ranking</h2>
<p>This model-as-ranking system works fine in many situations, but what happens when your predicted probability does not actually represent the probability and the business unit consuming your predictions assumes that they are? An example of this might be the likelihood of conversion for a given user, which is then multiplied by potential LTV to prioritize leads for a sales organization based on expected ROI. If a model tends to over/underestimate probabilities at the lower/upper ends of the predicted probability spectrum respctively (as random forest models have been known to do), you can end up spending effort on individuals who are less worth the team’s time, wasting resources and potentially losing revenue.</p>
<p>Scikit-learn has a great overview on some common algorithms that result in biased predicted probabilities. I’ve taken the liberty of displaying the chart from that overview here. Visit <a href="https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html">this link</a> to get the full code used to generate the plot or just look at the documentation for the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.calibration.calibration_curve.html">sklearn.calibration.calibration_curve</a> function</p>
<p><img src="/images/calibration_curve_1.png" alt="https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html" /></p>
<h2 id="parallel-model-consumption">Parallel Model Consumption</h2>
<p>Additionally, models can be used in conjunction with one another to provide targets in context. Going back to our expected LTV example, a business may have separate conversion likelihood models for different segments of their customer population, with every user being assigned a conversion probability from a single model. If not all models produce well-calibrated predicted probabilities, one could end up dominating the others while still having good metrics when considered individually.</p>
<h3 id="auroc-can-be-misleading">AUROC Can Be Misleading</h3>
<p>One common performance metric that is used to measure the effectiveness of the model across the range of predicted probabilities is the area under the receiving operating characteristic (ROC) curve. In case you aren’t familiar with the ROC curve, it is a plot of the model’s true positive rate vs the false positive rate as the probability is varied from 0 to 1, and as such, it is considered more of a robust metric than accuracy alone in cases where classes are imbalanced or the cost of true/false positives are unknown as of yet.</p>
<p>While it is a good metric, it is <strong>not</strong> sensitive to the absolute value of the predicted probabilities, only the performance at every probability point. If all of the predicted probabilities are multiplied by a constant, the value of the AUROC does not change, which may mislead the modeler into believing their probabilities are good to use, while in fact, they are consistently over/underestimating the results.</p>
<p>For example, the three predicted probability density distributions below are just scaled versions of the output from the same model. Their distributions are obviously very different from one another, but because they are scaled by a constant, they all have an equivalent AUROC score.</p>
<p><img src="/images/pred_probs_scaled.png" alt="" /></p>
<h1 id="validation-with-additional-scoring-methods">Validation with Additional Scoring Methods</h1>
<p>As with most modeling, it’s impossible to represent overall performance with a single number, and if you have concerns about validating probability calibration, it seems wise to include additional scores alongside AUROC that are more representative of actual differences in calibration such as log loss or the brier score.</p>
<p>Log loss is a common loss function, but <a href="https://en.wikipedia.org/wiki/Brier_score">brier score</a> was new to me, and according to wikipedia “can be thought of as… a measure of the ‘calibration’ of a set of probabilistic predictions”. It essentially is the average squared difference between the probability that was forecast and the actual outcome of the event. This makes its interpretation analogous to the RMSE for regression problems, and does take into account the scale of the predictions.</p>
<h1 id="sampling-bias">Sampling Bias</h1>
<p>Many problems have imbalanced datasets in terms of the target variable with a significant portion of the records belonging to one class. Various techniques have been developed to counteract these problem, including oversampling the minority class, downsampling the majority class and generating synthetic samples from the minority class to closer achieve class number parity. However, these techniques can result in increased AUROC scores while biasing the predicted probabilities to be less calibrated to actual.</p>
<p>Here is an example of how a generally well calibrated classifier (Logistic Regression) can be biased depending upon the ratio of the positive to negative class in the training dataset:</p>
<p><img src="/images/calibration_curve_2.png" alt="" /></p>
<h2 id="further-research">Further Research</h2>
<p>Scikit-learn has implemented the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html">CalibratedClassifierCV</a> class to adjust your classifiers to be more calibrated either during training, or to adjust the predictions by calibrating the classifier post-training.</p>
<p>It has two options for doing so:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt Scaling</a></li>
<li><a href="https://en.wikipedia.org/wiki/Isotonic_regression">Isotonic Regression</a></li>
</ul>Matthew Burkematthew.wesley.burke@gmail.comPredictions As Actual ProbabilitiesMy Intro to Generative Art2018-07-09T00:00:00+00:002018-07-09T00:00:00+00:00https://mwburke.github.io/generative%20art/2018/07/09/generative-art-p5js<h1 id="what-is-generative-art">What is generative art?</h1>
<p>Generative art is procedurally generated art for those of us who are less traditionally artistically inclined. More specifically, those who have no skill but still have enough appreciation for art and mathematical principles to automate the creation of things that look snice.</p>
<h1 id="javascript-libraries">Javascript Libraries</h1>
<p>The go-to library for web-based mathematical visualization is <a href="d3js.org">d3</a>, and many visualization libraries are based upon it. However, recently I stumbled across <a href="https://processing.org/">processing</a>, and it’s javascript equivalent <a href="https://p5js.org/">p5js</a>, which are amazing for creating procedurally generated visualizations. It’s inherently build to support an initialization process with the <code class="language-plaintext highlighter-rouge">setup</code> function and a function to update the visualization frame-by-frame with the <code class="language-plaintext highlighter-rouge">draw</code> function.</p>
<p>It’s easy to pick up and get going and create something really quickly, and it’s only been a few days since I’ve heard about it, and I’ve already had tons of fun learning the API and using the basics to create some “art” I’m happy with. I highly encourage you to check it out and maybe some of the stuff I’ve made recently <a href="http://mwburke.github.io.com/generative-art">at my interactive generative art website</a>.</p>
<p>If you are too lazy to click on the link, here’s a few examples of the static visualizations as well as the one in the post header.</p>
<p><a href="http://worksofchart.com/generative-art/posts/002.html"><img src="/images/generative-art-2.png" alt="" /></a></p>
<p><a href="https://worksofchart.com/generative/art/posts/010.html"><img src="/images/generative-art-10.png" alt="" /></a></p>
<p><a href="https://worksofchart.com/generative/art/posts/011.html"><img src="/images/generative-art-11.png" alt="" /></a></p>Matthew Burkematthew.wesley.burke@gmail.comIn-Browser Art with p5.js