Shining a Light on Black Box Models with LIME

Machine learning models are increasingly the primary means with which we both interact with our data and draw conclusions. However, these models are typically highly complex and difficult to debug. This characteristic of machine learning models leads to them frequently referred to as “black boxes” as there is often very little transparency - or at least very little upfront - of how the input data links with the output the model produces. This is much more of a problem for more sophisticated models, as it is generally accepted that more sophisticated models are equally more intractable.

The LIME project aims to address this issue. I have only recently been introduced to LIME by the course Data Science For Business by Matt Dancho, but it has definitely piqued my interests and opened up an entire area of research that I was previously unaware of. In this post, I want to go through the main motivating factors of the project and review the theory of how it works. Although the project has evolved somewhat from its inception, as I will not be going into code at this stage, the general discussion of the post should still be valid. This is a very interesting topic to me and I will want to return to it in the future, so this post should provide a strong foundation for future blog posts I will write on this topic. The best place to find out more about this is one of the original papers “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, which I’m sure we can agree is one the best named academic papers out there.

Why Clarity is Important

The basic way of interacting with a model is that we either directly or indirectly provide some input data and obtain some output. Especially in the case of the non-specialist, they may have no real understanding of either the link between the input data given to a model and the output it provides, and how the one model compares to another. As a data scientist, you may be able to draw some comparison between models during the test phase of model development such as by using an accuracy metric. However such metrics have their own problems: In general they do not capture the actual metrics of interests we want to optimise for i.e. actual business metrics, and do not, on their own at least, indicate why a model’s output may be less suitable than another, for instance a better performing model may more complicated to debug.

Taken together, the above points relate to issues of the trust placed upon the model. These trust issues relate to two key areas: (1) Can I trust the individual predictions made by the model? (2) Can I trust the behaviour of the model in production? To take this a bit further still, in a world increasingly aware of the practical implications of machine learning models, and especially so now that GDPR has come into action, we can no longer deny the ethical questions surrounding machine learning. It behoves use to understand how the model operates internally. A key step to resolving these issues is to grant a more intuitive understanding of the models behaviour.

As detailed in the original paper, experiments show how human subjects can successfully use the LIME library to choose better performing models and even go on to improve their performance. The key principle to this is how LIME generates local “explanations” of the model, which characterise sub-regions of the model’s original parameter space.

Unpacking “LIME”

The name LIME stands for, Local, Interpretable, Model-Agnostic, Explanations and can really be understood as the mission-statement of the package: The LIME algorithm wants to produce explanations of the predictions of any model (hence model-agnostic). By explanation, we mean something that provides a qualitative understanding between an instance’s components and the model’s predictions. A common way to do this in LIME is to use a bar chart to indicate the individual model components and the the degree to which they support or contradict the predicted class.

Localness has another quality associated with it: the local explanations must have fidelity to the predictions obtained from the original model. That is, the explanation should match the prediction of the original model in that local region as closely as it can. Unfortunately, this brings it into conflict with the need that these explanations be interpretable, that is, representations that are understandable to humans regardless of the underlying features. For instance, imagine a text classification task with a binary vector as output, regardless of the number of input features. By feeding more features into the model of the local explanation, we could anticipate greater accuracy with respect to the original model. However, this would add greater complexity to the final explanation as there will be that many more features contributing to the explanation.

In other words, LIME aims to programmatically find local and understandable approximations to a model, which are as faithful to the original model as possible. Such simplified models around single observations are generally referred to as “surrogate” models in the literature: surrogate, meaning simple models created to explain a more complex model.

The gold-standard for explainable models is something which is linear and monotonic. In turn these mean,

  • Linear: a model where the expected output is just a weighted sum of inputs, possibly with a constant additive term,
  • Monotonic: for instance in the case of a monotonically increasing function, the output only increases with increasing input values,

This is precisely what LIME tries to do by finding linear models to model local behaviour of the target model.

In order to find these local explanations, LIME proceeds using the following algorithm,

  1. Given an observation, permute it to create replicated feature data with slight value modifications. (This replica set is a set of instances sampled following a uniform distribution)
  2. Compute similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select m number of features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
  6. Use the resulting feature weights to explain local behaviour.

(The above steps are taken from the article “Visualizing ML Models with LIME” by “UC Business Analytics R Programming Guide”.)

The main benefit of this approach is its robustness: local explanations will still be locally faithful even for globally nonlinear models. The output of this algorithm is what is referred to as an “explanation” matrix, which has rows equal to the number of instances sampled, and columns for each feature. At this stage, the matrix produced is sufficient to provide local explanations in whatever form is deemed appropriate. In the next step, this matrix is used to characterise the global behaviour of a model.

Going Global

What about global understanding of the model? To achieve this, LIME picks a set of non-redundant instances derived from the instances sampled in the previous step, following an algorithm termed “submodular pick” or SP-LIME for short.

Once we have the explanation matrix from the previous step above, we need to derive the feature importance vector. The components of this vector give the global importance of the each of the features from the explanation matrix. The precise function mapping the explanation matrix to the importance vector depends on the the model under investigation, but in all cases should return higher values (meaning greater importance) for features that explain more instances i.e. features found across more instances globally.

Finally, SP-LIME only wants to find the minimal set of instances, such that there is no redundancy in the final returned set of instances. Non-redundant, meaning that the set of local explanations found should cover the maximum number of model features with little, if any, overlap amongst the features each individual local explanation relates to. The minimal set is chosen by a greedy approach that must satisfy a constraint that relates to the number of instances a human subject is willing to review.

In short, the approach taken by LIME is to provide a sufficient number of local explanations to explain the distinct behavioural regions of the model. This sacrifices the global fidelity of explanations in favour of local fidelity. As discussed in the original paper, this leads to better results in testing with both simulated and human users. In general though, while global interpretability can be approximate or based on average values, local interpretability can be more accurate than global explanations.

Closing Remarks

I want to return to LIME and the wider topic of machine learning interpretability in future blog posts, including how this works with H2O as well as being able to provide a more in-depth technical run-through of the library.

All the best,

Tom

The Why of Data-Driven Organisations

Something a bit different! My latest blog post is from the transcript for my lightning talk I will deliver at Infiniteconf 2018 in July. This is available to read now on medium. See you there!

All the best,

Tom

Data-Driven Decision-Making and CRISP-DM

A key driver of the rise of the so-called data-driven organisation is the increased awareness and use of data-driven decision-making (DDD) at all levels of the business. This is in a large part because of the increasing availability and quality of collected data, as well as increasing opportunities to make use of it. In this blog post I would like to discuss what DDD is, how DDD relates to data mining, and finally how to approach data mining projects. Much of the material for this post comes from my reading of “Data Science for Business”, by Foster Provost and Tom Fawcett.

Data-Driven Decision-Making

DDD refers to the use of data analysis to drive meaningful decision making. For example, a supermarket may be able to identify triggers that lead to changes in purchasing decisions and manage stock accordingly. Businesses following DDD principles have been found to demonstrate statistically significant improvements in their productivity (See “Data Science for Business” and references therein). However, that is not to say use of DDD should entirely preclude the use of intuition to help inform business decision: rather DDD should be another component to decision processes.

DDD itself relates not just to the use of data science and data mining techniques in isolation, but also to its automation, sometimes referred to specifically as “automated DDD”. It is with automated DDD that practices typical of data science or data mining really come into their own as distinct from other analytical techniques such as statistics, database queries etc. This is precisely because data mining allows a business to automate the search for knowledge and pattern recognition, although in reality though, there may always be an unavoidable manual aspect to this knowledge discovery process.

Previously I referred to “data science” and “data mining” separately. The distinction between the two concepts is always a bit unclear, however this separation is often useful to be able to refer to specific subtasks comprising the data mining cycle (see CRISP-DM below) as separate from the broader field of data science.

CRISP-DM

Having established the desirability of DDD, how do we achieve it? This is where we need to begin by first identifying the business problems we are facing or want to investigate, and see how to approach these using data mining.

Despite the seeming variety of business problems, they largely fit into one of a number of well known data mining tasks, such as classification, regression, and clustering. It is of course quite a skill to correctly identify precisely the kind of problems being addressed so as to decide early on what kind of approach to follow. This process of identification largely follows using a combination of understanding the kind of business problem being addressed and what data is available. Once the kind of data mining task is established, it becomes possible to approach it in a systematic way.

The high-level approach to engaging with business problems with data mining techniques is fairly well established, and is formalised by the framework known as the “Cross Industry Standard Process for Data Mining” or CRISP-DM. The following process diagram gives the relationship between the different phases of CRISP-DM. Diagram by Kenneth Jensen, distributed under a CC BY-SA 3.0 license

There is a lot of detail that can be adeed to this, but to highlight the main features,

  • CRISP-DM is cyclical in its very nature, given by the large circle bounding the diagram. This indicates an iterative approach to data projects
  • The starting point for any iteration should be with “business understanding” step
  • There are also multiple cycles within the overall CRISP-DM cycle, such as that between the tasks of “business understanding” and “data understanding”
  • A complete cycle from “deployment” to “business understanding” will usually follow from the successful completion of a project. This can follow from as new insights are generated by a previously successful model

The CRISP-DM diagram however does not perhaps do a good job of capturing the exploratory nature of data mining projects. This is much more so the case with data mining projects than typical software development projects due to the greater degree of inherent uncertainty e.g. overall project expectations may change from subtask to subtask. This may require a greater reliance on prototyping as opposed to iterative releases.

All the best,

Tom

Word Embeddings with Word2vec

If you’ve read any of my previous blog posts on information retrieval models, you should have come across a reference to a “bag-of-words” model. That is where we just consider the frequency of terms in a given document as the starting point for our more elaborate models. In many cases this is perfectly acceptable simplification and can lead the very powerful applications such as the retrieval models previously discussed. But what are we losing? We no longer no the have any sense of the structure of the original document and in particular the relationship between the words.

In this blog post we will look at one of the most popular and well-known word embedding algorithms, Word2vec. Using a word embedding such as that created by the Word2vec algorithm we can learn the semantic and syntactic relationship between words in our document or set of documents (termed “corpus”). Where I would explain these terms as,

  • Semantic: relating the meaning of the word
  • Syntactic: relating to the spelling/structure of a word e.g. plural vs singular

We will cover the fundamental concepts behind it, how it works, and some competing algorithms.

The reading for this blog post came from a combination of “Natural Language Processing in Action” and the original research papers by Tomas Mikolov et al.

Fundamentals

The Word2vec model continues an established tradition of using neural networks to establish a language model. Word2vec itself is a fairly simple recurrent neural network made of,

  • Input layer - words in corpus are passed in as one-hot encoded vectors
  • Recurrent hidden layer - this uses a sigmoid activation function, with a number of neurons equal to the number of dimensions used to encode all the words of the vocabulary
  • Output layer - this uses a sigmoid function to get a normalised probability vector from the activations of the hidden layer

The actual output value for a given run is taken to be word corresponding to the highest value in the probability vector. Once we have trained the network (see below) we can ignore the output of the network as we only care about the weights: these are what form the embeddings. In general, provided our corpus is not so specialised, we can use a pre-trained word embedding and avoid performing this step ourselves.

A big part of Word2vec is the ability to process the relationship between words, as learned by the word-embedding, using simple linear algebra - so-called “vector oriented reasoning”. Applicable to both the semantic and syntactic relationship between words, this proceeds by way of analogy,

  • Semantic: vec(king) - vec(man) + vec(woman) = vec(queen) i.e. “woman” is to “queen”, as “man” is to “king”
  • Syntactic: vec(apple) - vec(apples) ~= vec(banana) - vec(bananas) i.e. relate singular and plural forms

Where e.g. vec(king) is the word vector embedding given by dot product between input word vectors and weights matrix.

Approaches

How does Word2vec actually learn the relationships between words? This algorithm takes it that these relationships emerge from the cooccurrence of words. There are two main approaches to determine this cooccurrence: the skip-gram approach and the continuous bag of words (CBOW) approach.

Take the following sentence,

“The quick brown fox jumps over the lazy dog”

To generate training data we imagine having a small window that move across the document words. For instance, if this window is of five words width, we only consider samples of five words at a time (in the order they appear in the original document).

In the skip-gram approach, we are trying to predict the surrounding four words for a given input word - the word in the middle of our window. “Skip-gram” because we a creating n-grams that skip over words in the document e.g. want to find relationship between “The” for input “brown” ignoring “quick”. The table below demonstrates what this would look like for the example sentence above. In the table headings used, w_t refers to the input word, and e.g. w_t-2 is the word two places before the input.

Input Word w_t Expected Output w_t-2 Expected Output w_t-1 Expected Output w_t+1 Expected Output w_t+2
The     quick brown
quick   The brown fox
brown The quick fox jumps

The skip-gram approach can be viewed as a kind of “flipped” version of CBOW approach, and vice versa. Instead of trying the predict the surrounding words for a given input word, we are trying to find the target word for the set of surrounding words. This approach is termed “continuous” bag of words, as we can imagine finding a new bag of words for a given target word as we slide the window along our document. The table below demonstrates what this would look like for the example sentence above, with the same notation as the skip-gram example.

Input Word w_t-2 Input Word w_t-1 Input Word w_t+1 Input Word w_t+2 Expected Output w_t
    quick brown The
  The brown fox quick
The quick fox jumps brown

Using the network described above, we want to find the output vector of word probabilities for a given input word. This proceeds as a supervised learning task. Given the one-hot encoding of words in the corpus, each row in the weights matrix (from input to hidden layer) of our neural network is trained to represent the semantic meaning of individual words. That is, semantically similar words will have vector representations - they were originally surrounded by similar words.

When would you choose one approach over the other? The skip-gram approach can have superior performance over CBOW for a small corpus or with rare terms. This is because skip-gram generates more examples for a given word due to the network structure. On the other hand CBOW is faster to train and can produce higher accuracies for more frequent words.

All the best,

Tom

Distributed File Systems and MapReduce

This blog post discusses a solution to the problem in big data. Imagine dealing with a very large dataset, if we want to persist it in any practical way then we have two big problems to deal with due to the sheer size of the dataset. Firstly, the size of the dataset will make it impossible to store on a single machine or disk (in a general case). Secondly, the processing time will become painfully large without introducing some means to parallelise this process. Using a distributed file system (DFS) together with a MapReduce framework will address both these issues. This post will provide a high-level overview of these two technologies. In relation to DFS, we will consider the example of Google File System (GFS).

Google File System

GFS is a means of managing a distributed file system, that is, data storage across multiple machines. The core GFS architecture is based around a single centralised master node, which contains a lookup table to determine the storage location of the the individual files. The files themselves are stored on a much larger number of “Chunkservers”, so-called because they store files in multiple 64 MB chunks - with replication across the network. An application client talks directly to the master node.

MapReduce

MapReduce is a general framework for parallel programming. As above, imagine having a dataset so large that we want to avoid operating on it sequentially. To do this we want to be able to operate on multiple subsets on the dataset independently, but still be able to aggregate the separate subsets later on - afterall it’s the one dataset we’ve just split it up for our own convenience. This is precisely what “map” and “reduce” separately refer to,

  • Map: run some function over each and every element in each subset
  • Reduce: aggregate the subsets

To do this, we assume that our data is separated into key, value pairs. The map function will take in a set of key, value pairs, and return another set of key, value pairs (usually with the key being different afterwards). The reduce function can then group together elements in different subsets by matching on each unique key. Both the map and reduce functions are written by the developer, but the actual execution is left up to the framework.

The key strengths of both technologies are their generality as well as their ability to abstract away low-level details for the developer.

All the best,

Tom

Information Retrieval Models: Probabilistic Retrieval Model

In my previous blog post, I discussed vector space retrieval models, which ended with a discussion around the final ranking function. Although the ranking function is extremely powerful, many of the underlying assumptions seemed to be accepted simply because they worked well in practice: they were not founded on the theoretical model provided by the vector space model. By thinking about the ranking problem in a very different way, probabilistic retrieval models (PRMs) are able to offer a ranking function, built from a more satisfying theoretical foundation. More interestingly still, PRMs recover many of the features of VSMs.

Much like the previous blog post, most of this material came from my reading of “Text Data Management and Analysis” by ChengXiang Zhai and Sean Massung.

Set-up

In a PRM, the ranking function is given by the probability that a document, d, is relevant, R=1, to a query, q

Where R=0 corresponds to a document not relevant to q. To find the actual probability for a given query and document, we divide the number of relevant documents for a given query by the total number of documents retrieved for a query

The rank of a set of documents is then given by these calculated probabilities.

In the general case, rather than have a set of generated query and document terms (using the above equation), we approximate the probability using what is known as the query likelihood retrieval model. This can be interpreted as the probability that a user who considers document d relevant, would pose query q, given by the following equation.

Query Likelihood Retrieval Model

The relevance of a query is given by the probability given above. To calculate the probability of a query, we assume that query terms are independently sampled from a given document. This assumption of independence means that the total calculated probability for a query is the product of the portability of each query term,

for query terms i, j, in document d. This is all well and good, until we realise that the query terms might not be found in an otherwise relevant document. In other words we could end up with p(q|d)=0 even though the document is considered relevant by the user.

Instead of assuming the user chooses query terms from the document, we should assume the user chooses from a much large corpus, a “document language model”. This doesn’t change anything with our expression for the probability, p(q|d), but does fundamentally change the underlying probabilities we will use for each word: we shouldn’t find ourselves discounting otherwise relevant documents.

It’s also important at this stage to note that the actual relevance scoring function we will be working towards will be the logarithm of the query likelihood,

Document Language Model Smoothing

With the expression for the score above, we now need to estimate the document language model, that is the terms p(w|d). Importantly, we want to assign a nonzero probability to query words not found in document. This process of smoothing involves reducing the probability of seen query words and increasing the probability of unseen query words - as the sum of probabilities must equal one. To do this we say that the probability of a word in a given query is proportional to either the probability of the word in the document if it is found in the document,

or proportional to the probability of the word in a reference word collection C

Rewriting our equation for f(q,d) above (skipping a few steps), we get,

“Approximately equal”, because the above equation omits a sum over the probabilities of words not found in the document - this is irrelevant for ranking purposes. This is where things get interesting. As promised, the form of the equation gives us many of features we simply assumed for the VSM ranking function,

  • The numerator of the left hand expression effectively acts as the TF weighting
  • The denominator is the IDF weighting i.e. the more frequent the term w is in C, the more it is discounted to the final rank
  • The right hand side term is effectively equivalent to document length normalisation: the longer the document, the smaller this term as smoothing is less important.

Jelinek-Mercer smoothing

A specific example of smoothing is linear interpolation with a constant mixing coefficient or Jelinek-Mercer smoothing. In this model, we use a single coefficient to determine what non-zero probability to apply to a query word not found in a given document. Applied to all query terms, this gives p(w|d) as a weighted sum of the probability of a word in the given document and the probability of the word in corpus C.

where the mixing coefficient, lambda is a value between zero and one, inclusive.

Wrap-up

Given this expression for maximum likelihood estimate we can then substitute it into our expression for the ranking function above,

The final form of the scoring function given by Jelinek-Mercer smoothing is in fact very similar to that given by a VSM, since it is a sum of all the matched query terms. As noted above, we find many of the same features of VSMs that we get for free by virtue of the features of a PRM. An alternative to Jelinek-Mercer smoothing is Dirichlet prior or Bayesian smoothing, which uses a smoothing coefficient that depends on the size of the document in question.

All the best,

Tom

Information Retrieval Models: Vector Space Model

In this and the following blog post I want to provide a very high-level overview of how information retrieval models work. These models are designed to be able to find the best match to some query text from a corpus of text documents, by ranking each document by some quantitative measure of relevance. For instance, when I use a search engine, it will try to return documents that are considered most relevant for my current query. Common between all such information retrieval models, is that they assume a “bag-of-words” representation of text: any text sample is reducible to the set of words occurring in the text without regard to the grammar or word order. This also means that the query text can be broken down into a linear combination of individual query words.

This particular post will discuss the vector space model (VSM) framework for interpreting queries, documents, and the similarity between them. Working from a very basic understanding, we will see how we can achieve a ranking function equivalent to the state-of-art pivoted length normalisation by adding assumptions to our initial similarity function. The next post will look at probabilistic retrieval models, comparing them with VSM.

Much of this material came from my reading of “Text Data Management and Analysis” by ChengXiang Zhai and Sean Massung.

Set-up

VSM represents all the individually occurring words in our corpus as a dimension in a vector space, such that the total number of dimensions of vector space is the total size of the corpus. This allows us to represent the query of document text as a vector given by a linear sum of the of the individual words in this vector space. We only care if a word does or doesn’t appear in a query or document, so our query and document vectors only contain ones or zeros, indicating presence or absence respectively.

Given this representation, we can then determine that the similarity is given by the “closeness” between two vectors. In 2D, it is easy to show that the more similar query and document vectors have the smaller angle between them. More generally, we use the dot product operator, which becomes large and positive for near identical query and document vectors, and approaches 0 where the two vectors are completely different. The dot product is then just the sum over product,

for query and document vectors as defined above.

Term Frequency

What if we were to take into account the frequency with which a particular word occurred in a given document? This is the term frequency (TF). TF should give us a better sense of the relevance of a document in relation to a query, as it is likely a more relevant document will contain a query term more frequently. In doing this, we modify our vectors to include the frequency of each word in the vector, count(w,q), count(w,d) for the query and document vectors respectively. The equation for similarity above then becomes

where in this case x,y>=0. The rest of this blog, we will adjust count(w,d) so as to be able to produce more meaningful rankings.

Inverse Document Frequency

Term frequency alone may not give us the document ranking that we’d really want. It may turn out that a given query term is just very common, so just because a document contains lots of occurrences of this term is not a good gauge of that document’s relevance to the query. The remedy for this is to introduce inverse document frequency (IDF). This is a global statistic, which penalises words that are frequent across documents. IDF is often given as,

where M is the total number of documents in the collection, df(w) is “document frequency” the number of documents that contain the given term, w. The 1 in the numerator is just to prevent IDF(w) reducing to zero in as df(w) approached M.

TF Transformation

Similar to IDF, TF transformation penalises commonly occurring words. However, in this case, this penalty applies to words found in the target document only. As before, the presence of a given query term in a document is less relevant the more frequent it occurs in that document. This is often given by taking the logarithm of frequency with which a word query term occurs in a document. This is simply because logarithmic growth is very slow. The TF transformation used for pivoted length normalisation replaces our naive count(w,d) with TF(w,d), given by the following equation,

The most effective transformation that is commonly used is known as BM25 TF,

for some parameter k>=0. Unlike a simple logarithmic function, TF(w,d) above is bounded above by k+1. This is important to prevent any one term from dominating query results: a single term cannot spam query responses.

Document Length Normalisation

Finally, we want our similarity rankings to be able to take into account total document length. This is important to consider as a longer document is more likely to match a query - there’s simply more text that could match the query text. An effective approach is to use pivoted document length normalisation, which both penalises documents that are longer than the average document length, and rewards documents that are shorter. This variable DLN is given as,

where, |d| is the current document length, avdl is the average document length in the collection, and b a parameter between the values of zero and one, inclusive.

Wrap-up

Putting all of the above components together we get the following ranking function,

which is in fact the ranking function for the pivoted length normalisation ranking algorithm. Moving left to right we have,

  • The term frequency for the query
  • The term frequency for the document after applying the TF transformation
  • The pivoted length normalisation
  • The IDF

As as promised, we have seen step-by-step where all these components come from.

Obviously I have skipped a lot of detail for the sake of brevity, but there is still perhaps some lingering concern about some the components we’ve covered. Even though VSM gives a very robust and meaningful interpretation of terms in our corpus: they are vectors in a vector space, the other components seem to just be assumptions that just so happen to do well in application. This has been, and remains to be a problem for some researchers in this field, who continue to search for a more robust theoretical foundation of the heuristics given above. If this is you, then you’d be happy to hear that many of the features of VSM retrieval models emerge following the more mathematically robust approach taken by probabilistic retrieval models, which I’ll cover next time.

All the best,

Tom

Managing Python Virtual Environments

Having gotten use to using Anaconda and Conda to create and manage Python projects, I recently found myself in the position of wanting the same degree of dependency management for different projects without the need of installing additional software. Luckily enough, Python 3.3+ supports virtual environments out of the box albeit in a very stripped-down forn. This short post is just to demonstrate how to use this neat tool - things that I have been consistently forgetting! Some of these steps are specific to the OS and shell that I use: macOS and ZSH. YMMV.

Assuming your version of Python supports virtual environments, you can create a new virtual environment in your terminal with,

python -m venv my-virtual-environment

which will create all the dependencies for the new virtual environment in the current working directory. You can also specify a path to your new virtual environment as well. For the purpose of development, it’s easier to activate a virtual environment to save you from consistently specifying the environment’s full path during development. To activate the virtual environment use,

source my-virtual-environment/bin/activate

Within the activated virtual environment, you can use pip as required to install particular dependencies. You can then save all your project dependencies to a text file,

pip freeze > requirements.txt

and then use this to install the matching dependencies in another environment or project

pip install -r requirements.txt

Finally, leave your virtual environment with

deactivate

All the best,

Tom

TDDing a Simple KNN Classifier

In this blog I want to demonstrate how to implement a simple K Nearest Neighbour classifier from scratch using test-driven development (TDD). Although this is not something you would typically do day-to-day, it is a good way to fully understand how the classifier internals work albeit in a much more watered-down form. This blog takes a cue from the excellent Google Developers “Machine Learning Recipes” series, but goes approaches the classifier implementation using TDD as well as generalising the prediction algorithm.

Scikit-Learn Implementation

Our starting place will be the Scikit-Learn (abbreviated to sklearn) implementation of the KNN classifier,

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

print(accuracy_score(y_test, predictions))

Going from top to bottom, we,

  1. Load the iris data set from sklearn, this is a set of 150 entries with four features, each row corresponding to one of three labels
  2. Assign the input and target features to variables X and y
  3. Perform a test-train split on the data set
  4. Initialise the KNN classifier
  5. Pass in the training data. The training data forms the reference for our test data
  6. Test the classifier with the test data
  7. Print out the accuracy of the classifier

This is the basic workflow we want to retain, only we will call our own classifier rather than KNeighborsClassifier, while keeping the same API contract. The above code sample will also act as a performance benchmark, which we will return to later on.

Neighbours Benchmark Accuracy
1 90.7%
3 93.3%
5 96%
7 96%

Some Theory

Before we get to coding anything else, let’s think about how the KNN algorithm works. Put simply, we want to predict the label for a given row in our dataset. For a given row in the test data, we compare it against each row in the training data. To do this we make the assumption that the label for our given row must be the same as the label of the nearest neigbouring row in the training dataset. This can be just the single nearest neigbour or the most frequently occuring label of several nearest neigbour as specified by the n_neighbours parameter. Nearest in this case is given by some measure of the difference between individual features in each row under consideration - the smaller this difference the closer the points are, the more likely they are to have the same label. Our distance measure will be the Euclidean distance, which is just a general form of Pythagoras’ theorem:

# Each list give feature values for row
row_1 = [ 4.6,  3.1,  1.5,  0.2 ]
row_2 = [ 5.8,  2.8,  5.1,  2.4 ]

euc_distance = ( (4.6 - 5.8) ** 2 + (3.1 - 2.8) ** 2 + (1.5 - 5.1) ** 2 + (0.2 - 2.4) ** 2 ) ** 0.5

For this TDD approach we will be using pytest. To follow along create two files in the same directory, main.py and test.py. Running pytest test.py from the command line should run the test suite.

Nearest-Neighbour Implementation

In our first attempt, we will be creating a KNN classifier that only considers a single nearest neighbour, the correct label is simply the label of the single row in the training data with smallest distance to the current row in the test data under consideration. To code this implementation, we will be using the outside-in approach to TDD: we define the API interface, and code the internals from there.

Following the example of the sklearn code sample above, we want to be able to initialise the classifier with a parameter n_neighbours, which should default to one if not provided. Writing out our first test,

# test.py
def test_KNN_should_be_initialised_with_n_neighbors():
    clf = KNN()

    assert clf.n_neighbors == 1

Running this should obviously fail. In a new file main.py we want to create our new class with a constructor that satisfies the requirements of the test,

# main.py
class KNN():

    def __init__(self, n_neighbors=1):
        self.n_neighbors = n_neighbors

Then update the test file to import this file,

# test.py
from main import KNN

def test_KNN_should_be_initialised_with_n_neighbors():
    clf = KNN()

    assert clf.n_neighbors == 1

And we should have a green test.

We’ve now completed our first successful cycle of TDD, the rest is wash and repeat. Our next test should allow us to bootstrap our classifier by calling the fit() method. This allows us to pass in training data and training labels, X_train and y_train respectively. To test this, we need to specify some mock data corresponding to these values in addition to calling the classifier. The mock data could be anything though so long as it is truthy. For our purposes though, I’ve tried to mimick the structure of the iris data set of four features per input. The updated test file is below with the previous test case omitted for clarity.

# test.py

# mock data
X_train = [
    [0, 0, 0, 0]
]
y_train = [0]

# ... other code ...

def test_should_be_able_to_pass_training_data_to_classifier(n_neighbors):
    clf = KNN(n_neighbors)

    clf.fit(X_train, y_train)

    assert clf.X_train == X_train
    assert clf.y_train == y_train

The fix for this is very simple, just create a fit() method for our class,

# main.py
class KNN():

    # ... other code ...

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

We get to green once again. Now we come to actually predicting the label. Starting more simply this time, let’s write out a single test case following the API above,

#test.py
X_test = [[0, 0, 0, 0]]
y_test = 0

def test_predict_should_return_label_for_test_data():
    clf = KNN()

    clf.fit(X_train, y_train) # taken from the previous mock data

    predictions = clf.predict(X_test)

    assert predictions == y_test

In addition to the mock data previously defined, we have the test data X_test and y_test, which are the test inputs and the correct label respectively. The simple fix is to define predict() and return a hardcoded value

# main.py
class KNN():

    # ... other code ...

    def predict(self, X_test):
        return 0

This will make the current test green, but won’t help if the test data changes, which is precisely what we’ll do

#test.py
X_train = [
    [0, 0, 0, 0],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [2, 2, 2, 2],
    [2, 2, 2, 2],
    [2, 2, 2, 2],
    [2, 2, 2, 2]
]
y_train = [0, 1, 1, 2, 2, 2, 2]

# ... other code ...

I’ll come to explain the precise arrangement later on, but now we have multiple contending rows. The algorithm for prediction is as follows, for a given X_test value,

  • Initialise variables to keep track of the smallest distance and corresponding index for the row in the training data
  • Then iterate through the training data
  • For a given row in training data find the distance between that row and X_test
  • If the newly calculated distance is smaller than current smallest distance then update both the smallest distance and the corresponding index
  • Having gone through all the training data, return the label corresponding to the closest index

The final piece of the puzzle is how we will calculate the distance between two rows, which as we said above with be given by the Euclidean distance. Rather than do this from scratch we will import the distance module from the SciPy library. Putting this altogether we should get the following.

# main.py
from scipy.spatial import distance

class KNN():

    # ... other code ...

    def predict(self, X_test):
        closest_distance = distance.euclidean(X_test, self.X_train[0])
        closest_index = 0
        for i in range(1, len(self.X_train)):
            dist = distance.euclidean(X_test, self.X_train[i])
            if dist < closest_distance:
                closest_distance = dist
                closest_index = i
        return self.y_train[closest_index]
        

The final step to make is to consider the case where we have multiple rows in X_test - we need to call the above method for each row. I’m going to skip through the steps to do this, but the key word is ‘refactor’: I’ve moved the previous implementation into its own method __closest. predict now iteratively calls this new method and returns an array of predictions corresponding to each row in X_test. To TDD this, simply extend X_test and y_test.

# main.py
class KNN():

    # ... other code ...

    def predict(self, X_test):
        predictions = []
        for row in X_test:
            prediction = self.__closest(row)
            predictions.append(prediction)
        return predictions

    def __closest(self, row):
        closest_distance = distance.euclidean(X_test, self.X_train[0])
        closest_index = 0
        for i in range(1, len(self.X_train)):
            dist = distance.euclidean(X_test, self.X_train[i])
            if dist < closest_distance:
                closest_distance = dist
                closest_index = i
        return self.y_train[closest_index]
        

This is the complete implementation of our KNN classifier where n_neigbors is implicity set to one. The reported accuracy is 90.7% - matching the benchmark perfectly.


KNN Implementation

Now we want to generalise the classifier where n_neighbors is some odd number greater than or equal to one (odd to prevent ties). Having tests in place makes this process much easier as we can always backtrack if we break some functionality. Before going any further, we should add some test cases - we want to follow the philosophy of testing requirements not specific examples, or in other words a single test should run with a number of different parameters. Pytest terms this ‘parametrizing’ test functions:

# test.py
# ... other code ...

X_test = [[0, 0, 0, 0]]
@pytest.mark.parametrize(('n_neighbors'),[1,3,5])
def test_KNN_should_be_initialised_with_n_neighbors(n_neighbors):
    clf = KNN(n_neighbors)

    assert clf.n_neighbors == n_neighbors

The parameters used are completely arbitrary, but will form the basis of a comparison with the benchmark later on. Running this now, we should see three separate tests execute. The same change should also be made to the fit() test. Updating the predict() test is a little different, because we want to update this function’s internals as well. So far we have only considered taking the label of the single nearest row in the training data as the correct label - but what if we were to take the most frequently occuring label from a set of training data rows? We should see an increase in the reported accuracy.

The algorithm will be as follows,

  1. For a given row in the test data, find all the distances from the rows in the training data following our distance measure
  2. For this distances, find the k smallest values, where k is given by n_neighbors
  3. For these k values, return the most common, this is the predicted label

Before we get to the implementation, we need to set up the tests. We need to parameterise both n_neighbors and y_test - the predicted label. Why’s that? Well first, think about what the new algoritm does: it returns the most commonly occuring value from a set of values. What might not be clear is what that means for us given that I want to keep X_train and y_train unchanged, which are repeated below.

X_train = [
    [0, 0, 0, 0],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [2, 2, 2, 2],
    [2, 2, 2, 2],
    [2, 2, 2, 2],
    [2, 2, 2, 2]
]
y_train = [0, 1, 1, 2, 2, 2, 2]

If we take the k smallest distances between X_test and each of the rows in X_train, then we might find ourselves with a tie - something I don’t want to consider for the sake of simplicity. Considering odd values for n_neighbors only for X_test = [[0, 0, 0, 0]], then if n_neighbors = 1, our predicted label should be 0 as this is the label of the single row with the smallest distance from X_test. For n_neighbors = 3, we should instead get a predicted label of 1. That is because although row [0, 0, 0, 0] has the single smallest distance, of the three rows with the smallest distance [[0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]], two of the three have the label 1. Moving to n_neighbors = 5 is where things get difficult, because for our X_train the five rows with smallest distances will have labels [0, 1, 1, 2, 2] leading to a tie. This is precisely what I don’t want! So if I’m committed to keeping X_train and y_train constant the best thing to do is skip n_neighbors = 5 and instead consider n_neighbors = 7. That is simply because for the given X_train and y_Train above, I can get a single most common value from the labels [0, 1, 1, 2, 2, 2, 2].

Putting the above discussion altogether, for our X_train and y_train, the predicted label, y_test will vary for different values of n_neighbors. So our updated predict() test will look like,

# test.py
@pytest.mark.parametrize(('n_neighbors', 'y_test'),[(1, [0]),(3, [1]), (7, [2])])
def test_predict_should_return_label_for_test_data(n_neighbors, y_test):
    clf = KNN(n_neighbors)

    clf.fit(X_train, y_train)

    predictions = clf.predict(X_test)

    assert predictions == y_test

Now we can break the tests! We don’t need to do anything directly to predict() as it is just the interface. To make this easier, I want to split __closest into two methods, __closest, which will now find and return a list of the distances betweet X_test and the rows in X_train, and __vote, which will gind the most common values out of the smallest n_neighbors distances.

# main.py
from scipy.spatial import distance
from collections import Counter

class KNN():

    # ... other code ...

    def __closest(self, row):
        distances = []
        for i in range(len(self.X_train)):
            dist = distance.euclidean(row, self.X_train[i])
            distances.append((self.y_train[i], dist))
        sorted_distances = sorted(distances, key=lambda x: x[1])
        return self.__vote(sorted_distances)

    def __vote(self, distances):
        labels = []
        for i in range(self.n_neighbors):
            labels.append(distances[i][0])
        return Counter(labels).most_common(1)[0][0]
        

There are few things going on the code sample worth highlighting,

  • As we want to return the label, not the distance, I chose to create a tuple for each distance and the corresponding label in y_train. These are then appended to the list
  • To make __vote simpler, we are passing a sorted list of tuples, sorting by the distance. There are many ways to do this, but I used a lambda function
  • __vote should be fairly self-explanatory. To make things simpler, I’m using a Counter object to find the most common label from the list of labels

Using a list comprehension we can make this much cleaner,

from collections import Counter

// ... more code ...

def __vote(self, distances):
    return Counter(x[0] for x in distances[:self.n_neighbors]).most_common(1)[0][0]

The full code can be found here.

Evaluation

Putting this altogether, and rerunning the first code sample only substituting the sklearn KNN classifier for our own we obtain,

Neighbours Our Accuracy Benchmark Accuracy
1 90.7% 90.7%
3 93.3% 93.3%
5 96% 96%
7 96% 96%

Which is great news: we are able to obtain the same accuracy with our simple implementation. However our implemenation is several orders of magnitude slower than sklearn’s. Interesting to note, but perhaps not suprising, our original implementation with n_neighbors hardcoded to one has a faster running time of around 80 ms.

Neighbours Our Running Time (ms) Benchmark Running Time (ms)
1 95.4 0.451
3 94.3 0.539
5 88.6 0.458
7 102 0.465

All the best,

Tom

TensorFlow at 10,000 Feet

TensorFlow is an opensource library maintained by Google that has become a particularly popular choice for deep learning. Among the reasons for its popularity,

  • Simple Python API,
  • Resource management across multiple CPUs and GPUs, or even a remote server,
  • A lot of other conveniences that make for quick prototyping and debugging.

In this post, I want to provide a very high-level overview of TensorFlow. We won’t get to code anything meaningful, but instead focus on some of the key concepts that make this library so confusing at first blush. To my mind at least, these are,

  1. What is the TensorFlow workflow?
  2. How to use constants, variables and placeholders?
  3. How to share scope?
  4. Session Management

Hopefully this post will go some way to explaining these points and more.

1. TensorFlow Workflow

Fundamental to understanding how to use the library is understanding the TensorFlow workflow, which breaks into two parts,

  • Construction phase: where we declare the values and operators
  • Execution phase: evaluate values and operators

Key to this distinction between construction and evaluation phases, is how TensorFlow thinks about your code internally. TensorFlow represents the code written in the construction phase as a dataflow graph abstraction. Without going into too much detail, this abstraction is how TensorFlow is able to execute operations in parallel across multiple CPUs and GPUs without additional difficulty to the programmer. This graph is composed of nodes and edges, where the nodes are operators and the edges are the values declared in the construction phase. All values are implicitly treated as tensors behind the scenes, and by tensor, we basically mean a nested Python list.

Aside: Bringing these concepts together we can kind of speculate the origin of the TensorFlow name, where “Tensor” refers to the basic unit of abstraction in the library and “Flow” being the flow of data through the data graph.

So up to this point, we’ve defined our values but can’t execute them - this is performed by a session. This is also where TensorFlow initialises our values defined in the construction phases as well as where it manages the hardware resources for us.

Sessions also underline the need to stress the distinction between the construction and execution phases: graphs are not stateful. That is, multiple sessions reusing the same graph will not share state. However, we can choose to checkpoint and restore our values, as we’ll see below. Before getting to the code samples, we need to introduce the different tensor types.

2. Constants, Variables and Placeholders

TensorFlow defines three main types of tensor for us: constants, placeholders, and variables. Constants are values that will not change, variables are values that can change, and placeholders are values that are assigned when we execute the graph. To make this a bit more concrete, we can relate these tensor types to a typical machine learning model. Constants for us will define hyperparameters values that are set once per model, variables will define values we want to change automatically during training like weights and biases, and placeholders are values we fed in from our dataset like our training labels.

Use of constants are the easiest to show. We simply assign constants to variable at the top, launch a session in a with statement, and evaluate the expression “a + b” using sess.run. We’ll be using a similar pattern in all the code samples in this post. By using a with statement to encapsulate our session, we ensure that session resources are freed up outside of the statement.

import tensorflow as tf

a = tf.constant(1.0)
b = tf.constant(2.0)

with tf.Session() as sess:
    sess.run(a + b)

Moving onto placeholders, these are the access point for custom values into the session. To do this we use a feed to pass placeholders into the session. The setup is very similar to constants, but instead of setting a value we just declare the variable type.

a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)

with tf.Session() as sess:
    sess.run(a + b, feed_dict={a: 1.0, b: 2.0})

Moving on to variables, the setup is as per constants but if we run this code sample as is,

a = tf.Variable(1.0)
b = tf.Variable(2.0)

with tf.Session() as sess:
    result = sess.run(a + b)
#=> FailedPreconditionError (see above for traceback): Attempting to use uninitialized value

We get an error. This tells us we need initialise the variable. We need to initialise variables in order to set the type and shape of the variable, which remains fixed for the graph. This is similar to how Python variables will raise a NameError if declared without a value. To fix this we call an initialiser like so. The easiest approach to take for us is to call the global initializer, which will initialise all variables at once. We can also initialise targeted variables as well as we will see in the final code sample of this blog.

a = tf.Variable(1.0)
b = tf.Variable(2.0)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(a + b)

3. Scope

Scope is important for precisely the reasons you’d expect: it allows us to namespace variables to ensure correct reuse. The code sample below gives a typical use case, the variable a is scoped to my_variables.

with tf.variable_scope('my_variables'):
  a = tf.get_variable('a')

print(a.name) #=> my_variables/a:0

The 0 in the variable name isn’t anything to worry about - it is just because of how TensorFlow handles tensors behind the scenes.

4. Session Management

So far we’ve seen how we execute a graph within a session, but we can also use a session to save and restore variables, using the Saver class. The following demonstrates how we can checkpoint a session,

saver = tf.train.Saver()

with tf.Session() as sess:
	sess.run(...)

	saver.save(sess, checkpoint)

And we can subsequently load a previous session,

saver = tf.train.Saver()

with tf.Session() as sess:
	saver.restore(sess, checkpoint)

	sess.run(...)

Session management is made use of a lot in the wild, where we can save variables after training a model, and later on load these variables before testing. Below is an example of initialising variables in combination with restoring from a previous session.

with tf.Session() as sess: 
sess.run(tf.global_variables_initializer())
saver.restore(sess, checkpoint)

All the best,

Tom