## Information Retrieval Models: Probabilistic Retrieval Model

In my previous blog post, I discussed vector space retrieval models, which ended with a discussion around the final ranking function. Although the ranking function is extremely powerful, many of the underlying assumptions seemed to be accepted simply because they worked well in practice: they were not founded on the theoretical model provided by the vector space model. By thinking about the ranking problem in a very different way, probabilistic retrieval models (PRMs) are able to offer a ranking function, built from a more satisfying theoretical foundation. More interestingly still, PRMs recover many of the features of VSMs.

Much like the previous blog post, most of this material came from my reading of “Text Data Management and Analysis” by ChengXiang Zhai and Sean Massung.

## Set-up

In a PRM, the ranking function is given by the probability that a document, d, is relevant, R=1, to a query, q

$f\left(q, d\right) = p\left(R=1|d,q\right), R\in{0,1}$

Where R=0 corresponds to a document not relevant to q. To find the actual probability for a given query and document, we divide the number of relevant documents for a given query by the total number of documents retrieved for a query

$p\left(R=1|d,q\right) = \frac{count\left(R=1|d,q\right)}{count\left(d,q\right)}$

The rank of a set of documents is then given by these calculated probabilities.

In the general case, rather than have a set of generated query and document terms (using the above equation), we approximate the probability using what is known as the query likelihood retrieval model. This can be interpreted as the probability that a user who considers document d relevant, would pose query q, given by the following equation.

$p\left(q|d,R=1\right)$

## Query Likelihood Retrieval Model

The relevance of a query is given by the probability given above. To calculate the probability of a query, we assume that query terms are independently sampled from a given document. This assumption of independence means that the total calculated probability for a query is the product of the portability of each query term,

$p\left(q|d\right) = \frac{i}{\left|d\right|}*\frac{j}{\left|d\right|}$

for query terms i, j, in document d. This is all well and good, until we realise that the query terms might not be found in an otherwise relevant document. In other words we could end up with p(q|d)=0 even though the document is considered relevant by the user.

Instead of assuming the user chooses query terms from the document, we should assume the user chooses from a much large corpus, a “document language model”. This doesn’t change anything with our expression for the probability, p(q|d), but does fundamentally change the underlying probabilities we will use for each word: we shouldn’t find ourselves discounting otherwise relevant documents.

It’s also important at this stage to note that the actual relevance scoring function we will be working towards will be the logarithm of the query likelihood,

$f(q,d)=\log p\left(q|d\right)=\sum_{i=1}^{N} \log p\left(w_i|d\right)=\sum_{w\in V} count\left(w,q\right)\log p\left(w|d\right)$

## Document Language Model Smoothing

With the expression for the score above, we now need to estimate the document language model, that is the terms p(w|d). Importantly, we want to assign a nonzero probability to query words not found in document. This process of smoothing involves reducing the probability of seen query words and increasing the probability of unseen query words - as the sum of probabilities must equal one. To do this we say that the probability of a word in a given query is proportional to either the probability of the word in the document if it is found in the document,

$p\left(w|d\right) = p_{seen}\left(w|d\right)$

or proportional to the probability of the word in a reference word collection C

$p\left(w|d\right) = \alpha_d p\left(w|C\right)$

Rewriting our equation for f(q,d) above (skipping a few steps), we get,

$\log p\left(q|d\right) \simeq \sum_{w\in d, w \in q} count\left(w,q\right)\log \frac{p_{seen}\left(w|d\right)}{\alpha_d p\left(w|C\right)}+n\log \alpha_d$

“Approximately equal”, because the above equation omits a sum over the probabilities of words not found in the document - this is irrelevant for ranking purposes. This is where things get interesting. As promised, the form of the equation gives us many of features we simply assumed for the VSM ranking function,

• The numerator of the left hand expression effectively acts as the TF weighting
• The denominator is the IDF weighting i.e. the more frequent the term w is in C, the more it is discounted to the final rank
• The right hand side term is effectively equivalent to document length normalisation: the longer the document, the smaller this term as smoothing is less important.

## Jelinek-Mercer smoothing

A specific example of smoothing is linear interpolation with a constant mixing coefficient or Jelinek-Mercer smoothing. In this model, we use a single coefficient to determine what non-zero probability to apply to a query word not found in a given document. Applied to all query terms, this gives p(w|d) as a weighted sum of the probability of a word in the given document and the probability of the word in corpus C.

$p\left(w|d\right) = \left(1-\lambda\right)\frac{count\left(w,d\right)}{\left|d\right|}+\lambda p\left(w|C\right)$

where the mixing coefficient, lambda is a value between zero and one, inclusive.

## Wrap-up

Given this expression for maximum likelihood estimate we can then substitute it into our expression for the ranking function above,

$f\left(q,d\right)=\sum_{w\in d, w \in q} count\left(w,q\right)\log \left(1+\frac{1-\lambda}{\lambda}\frac{count\left(w,d\right)}{\left|d\right|p\left(w|C\right)}\right)$

The final form of the scoring function given by Jelinek-Mercer smoothing is in fact very similar to that given by a VSM, since it is a sum of all the matched query terms. As noted above, we find many of the same features of VSMs that we get for free by virtue of the features of a PRM. An alternative to Jelinek-Mercer smoothing is Dirichlet prior or Bayesian smoothing, which uses a smoothing coefficient that depends on the size of the document in question.

All the best,

Tom

## Information Retrieval Models: Vector Space Model

In this and the following blog post I want to provide a very high-level overview of how information retrieval models work. These models are designed to be able to find the best match to some query text from a corpus of text documents, by ranking each document by some quantitative measure of relevance. For instance, when I use a search engine, it will try to return documents that are considered most relevant for my current query. Common between all such information retrieval models, is that they assume a “bag-of-words” representation of text: any text sample is reducible to the set of words occurring in the text without regard to the grammar or word order. This also means that the query text can be broken down into a linear combination of individual query words.

This particular post will discuss the vector space model (VSM) framework for interpreting queries, documents, and the similarity between them. Working from a very basic understanding, we will see how we can achieve a ranking function equivalent to the state-of-art pivoted length normalisation by adding assumptions to our initial similarity function. The next post will look at probabilistic retrieval models, comparing them with VSM.

Much of this material came from my reading of “Text Data Management and Analysis” by ChengXiang Zhai and Sean Massung.

## Set-up

VSM represents all the individually occurring words in our corpus as a dimension in a vector space, such that the total number of dimensions of vector space is the total size of the corpus. This allows us to represent the query of document text as a vector given by a linear sum of the of the individual words in this vector space. We only care if a word does or doesn’t appear in a query or document, so our query and document vectors only contain ones or zeros, indicating presence or absence respectively.

$q=\left(x_1,...,x_N\right), d=\left(y_1,...,y_N\right)$

Given this representation, we can then determine that the similarity is given by the “closeness” between two vectors. In 2D, it is easy to show that the more similar query and document vectors have the smaller angle between them. More generally, we use the dot product operator, which becomes large and positive for near identical query and document vectors, and approaches 0 where the two vectors are completely different. The dot product is then just the sum over product,

$sim(q,d)=q \cdot d=\sum_{i=1}^{N} x_iy_i$

for query and document vectors as defined above.

## Term Frequency

What if we were to take into account the frequency with which a particular word occurred in a given document? This is the term frequency (TF). TF should give us a better sense of the relevance of a document in relation to a query, as it is likely a more relevant document will contain a query term more frequently. In doing this, we modify our vectors to include the frequency of each word in the vector, count(w,q), count(w,d) for the query and document vectors respectively. The equation for similarity above then becomes

$sim(count(w,q),count(w,d))=\sum_{i=1}^{N} x_iy_i$

where in this case x,y>=0. The rest of this blog, we will adjust count(w,d) so as to be able to produce more meaningful rankings.

## Inverse Document Frequency

Term frequency alone may not give us the document ranking that we’d really want. It may turn out that a given query term is just very common, so just because a document contains lots of occurrences of this term is not a good gauge of that document’s relevance to the query. The remedy for this is to introduce inverse document frequency (IDF). This is a global statistic, which penalises words that are frequent across documents. IDF is often given as,

$IDF(w)=\log\frac{M+1}{df(w)}$

where M is the total number of documents in the collection, df(w) is “document frequency” the number of documents that contain the given term, w. The 1 in the numerator is just to prevent IDF(w) reducing to zero in as df(w) approached M.

## TF Transformation

Similar to IDF, TF transformation penalises commonly occurring words. However, in this case, this penalty applies to words found in the target document only. As before, the presence of a given query term in a document is less relevant the more frequent it occurs in that document. This is often given by taking the logarithm of frequency with which a word query term occurs in a document. This is simply because logarithmic growth is very slow. The TF transformation used for pivoted length normalisation replaces our naive count(w,d) with TF(w,d), given by the following equation,

$TF(w,d)=\ln\left(1+\ln\left(1+count(w,d)\right)\right)$

The most effective transformation that is commonly used is known as BM25 TF,

$TF(w,d)=\frac{(k+1)count(w,d)}{count(w,d)+k}$

for some parameter k>=0. Unlike a simple logarithmic function, TF(w,d) above is bounded above by k+1. This is important to prevent any one term from dominating query results: a single term cannot spam query responses.

## Document Length Normalisation

Finally, we want our similarity rankings to be able to take into account total document length. This is important to consider as a longer document is more likely to match a query - there’s simply more text that could match the query text. An effective approach is to use pivoted document length normalisation, which both penalises documents that are longer than the average document length, and rewards documents that are shorter. This variable DLN is given as,

$DLN=1-b+b\frac{|d|}{avdl}$

where, |d| is the current document length, avdl is the average document length in the collection, and b a parameter between the values of zero and one, inclusive.

## Wrap-up

Putting all of the above components together we get the following ranking function,

$f(q,d)=\sum_w count(w,q)\frac{\ln\left(1+\ln\left(1+count(w,d)\right)\right)}{\left(1-b+b\frac{|d|}{avdl}\right)}\log\frac{M+1}{df(w)}$

which is in fact the ranking function for the pivoted length normalisation ranking algorithm. Moving left to right we have,

• The term frequency for the query
• The term frequency for the document after applying the TF transformation
• The pivoted length normalisation
• The IDF

As as promised, we have seen step-by-step where all these components come from.

Obviously I have skipped a lot of detail for the sake of brevity, but there is still perhaps some lingering concern about some the components we’ve covered. Even though VSM gives a very robust and meaningful interpretation of terms in our corpus: they are vectors in a vector space, the other components seem to just be assumptions that just so happen to do well in application. This has been, and remains to be a problem for some researchers in this field, who continue to search for a more robust theoretical foundation of the heuristics given above. If this is you, then you’d be happy to hear that many of the features of VSM retrieval models emerge following the more mathematically robust approach taken by probabilistic retrieval models, which I’ll cover next time.

All the best,

Tom

## Managing Python Virtual Environments

Having gotten use to using Anaconda and Conda to create and manage Python projects, I recently found myself in the position of wanting the same degree of dependency management for different projects without the need of installing additional software. Luckily enough, Python 3.3+ supports virtual environments out of the box albeit in a very stripped-down forn. This short post is just to demonstrate how to use this neat tool - things that I have been consistently forgetting! Some of these steps are specific to the OS and shell that I use: macOS and ZSH. YMMV.

Assuming your version of Python supports virtual environments, you can create a new virtual environment in your terminal with,

which will create all the dependencies for the new virtual environment in the current working directory. You can also specify a path to your new virtual environment as well. For the purpose of development, it’s easier to activate a virtual environment to save you from consistently specifying the environment’s full path during development. To activate the virtual environment use,

Within the activated virtual environment, you can use pip as required to install particular dependencies. You can then save all your project dependencies to a text file,

and then use this to install the matching dependencies in another environment or project

Finally, leave your virtual environment with

All the best,

Tom

## TDDing a Simple KNN Classifier

In this blog I want to demonstrate how to implement a simple K Nearest Neighbour classifier from scratch using test-driven development (TDD). Although this is not something you would typically do day-to-day, it is a good way to fully understand how the classifier internals work albeit in a much more watered-down form. This blog takes a cue from the excellent Google Developers “Machine Learning Recipes” series, but goes approaches the classifier implementation using TDD as well as generalising the prediction algorithm.

## Scikit-Learn Implementation

Our starting place will be the Scikit-Learn (abbreviated to sklearn) implementation of the KNN classifier,

Going from top to bottom, we,

1. Load the iris data set from sklearn, this is a set of 150 entries with four features, each row corresponding to one of three labels
2. Assign the input and target features to variables X and y
3. Perform a test-train split on the data set
4. Initialise the KNN classifier
5. Pass in the training data. The training data forms the reference for our test data
6. Test the classifier with the test data
7. Print out the accuracy of the classifier

This is the basic workflow we want to retain, only we will call our own classifier rather than KNeighborsClassifier, while keeping the same API contract. The above code sample will also act as a performance benchmark, which we will return to later on.

Neighbours Benchmark Accuracy
1 90.7%
3 93.3%
5 96%
7 96%

## Some Theory

Before we get to coding anything else, let’s think about how the KNN algorithm works. Put simply, we want to predict the label for a given row in our dataset. For a given row in the test data, we compare it against each row in the training data. To do this we make the assumption that the label for our given row must be the same as the label of the nearest neigbouring row in the training dataset. This can be just the single nearest neigbour or the most frequently occuring label of several nearest neigbour as specified by the n_neighbours parameter. Nearest in this case is given by some measure of the difference between individual features in each row under consideration - the smaller this difference the closer the points are, the more likely they are to have the same label. Our distance measure will be the Euclidean distance, which is just a general form of Pythagoras’ theorem:

For this TDD approach we will be using pytest. To follow along create two files in the same directory, main.py and test.py. Running pytest test.py from the command line should run the test suite.

## Nearest-Neighbour Implementation

In our first attempt, we will be creating a KNN classifier that only considers a single nearest neighbour, the correct label is simply the label of the single row in the training data with smallest distance to the current row in the test data under consideration. To code this implementation, we will be using the outside-in approach to TDD: we define the API interface, and code the internals from there.

Following the example of the sklearn code sample above, we want to be able to initialise the classifier with a parameter n_neighbours, which should default to one if not provided. Writing out our first test,

Running this should obviously fail. In a new file main.py we want to create our new class with a constructor that satisfies the requirements of the test,

Then update the test file to import this file,

And we should have a green test.

We’ve now completed our first successful cycle of TDD, the rest is wash and repeat. Our next test should allow us to bootstrap our classifier by calling the fit() method. This allows us to pass in training data and training labels, X_train and y_train respectively. To test this, we need to specify some mock data corresponding to these values in addition to calling the classifier. The mock data could be anything though so long as it is truthy. For our purposes though, I’ve tried to mimick the structure of the iris data set of four features per input. The updated test file is below with the previous test case omitted for clarity.

The fix for this is very simple, just create a fit() method for our class,

We get to green once again. Now we come to actually predicting the label. Starting more simply this time, let’s write out a single test case following the API above,

In addition to the mock data previously defined, we have the test data X_test and y_test, which are the test inputs and the correct label respectively. The simple fix is to define predict() and return a hardcoded value

This will make the current test green, but won’t help if the test data changes, which is precisely what we’ll do

I’ll come to explain the precise arrangement later on, but now we have multiple contending rows. The algorithm for prediction is as follows, for a given X_test value,

• Initialise variables to keep track of the smallest distance and corresponding index for the row in the training data
• Then iterate through the training data
• For a given row in training data find the distance between that row and X_test
• If the newly calculated distance is smaller than current smallest distance then update both the smallest distance and the corresponding index
• Having gone through all the training data, return the label corresponding to the closest index

The final piece of the puzzle is how we will calculate the distance between two rows, which as we said above with be given by the Euclidean distance. Rather than do this from scratch we will import the distance module from the SciPy library. Putting this altogether we should get the following.

The final step to make is to consider the case where we have multiple rows in X_test - we need to call the above method for each row. I’m going to skip through the steps to do this, but the key word is ‘refactor’: I’ve moved the previous implementation into its own method __closest. predict now iteratively calls this new method and returns an array of predictions corresponding to each row in X_test. To TDD this, simply extend X_test and y_test.

This is the complete implementation of our KNN classifier where n_neigbors is implicity set to one. The reported accuracy is 90.7% - matching the benchmark perfectly.

## KNN Implementation

Now we want to generalise the classifier where n_neighbors is some odd number greater than or equal to one (odd to prevent ties). Having tests in place makes this process much easier as we can always backtrack if we break some functionality. Before going any further, we should add some test cases - we want to follow the philosophy of testing requirements not specific examples, or in other words a single test should run with a number of different parameters. Pytest terms this ‘parametrizing’ test functions:

The parameters used are completely arbitrary, but will form the basis of a comparison with the benchmark later on. Running this now, we should see three separate tests execute. The same change should also be made to the fit() test. Updating the predict() test is a little different, because we want to update this function’s internals as well. So far we have only considered taking the label of the single nearest row in the training data as the correct label - but what if we were to take the most frequently occuring label from a set of training data rows? We should see an increase in the reported accuracy.

The algorithm will be as follows,

1. For a given row in the test data, find all the distances from the rows in the training data following our distance measure
2. For this distances, find the k smallest values, where k is given by n_neighbors
3. For these k values, return the most common, this is the predicted label

Before we get to the implementation, we need to set up the tests. We need to parameterise both n_neighbors and y_test - the predicted label. Why’s that? Well first, think about what the new algoritm does: it returns the most commonly occuring value from a set of values. What might not be clear is what that means for us given that I want to keep X_train and y_train unchanged, which are repeated below.

If we take the k smallest distances between X_test and each of the rows in X_train, then we might find ourselves with a tie - something I don’t want to consider for the sake of simplicity. Considering odd values for n_neighbors only for X_test = [[0, 0, 0, 0]], then if n_neighbors = 1, our predicted label should be 0 as this is the label of the single row with the smallest distance from X_test. For n_neighbors = 3, we should instead get a predicted label of 1. That is because although row [0, 0, 0, 0] has the single smallest distance, of the three rows with the smallest distance [[0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]], two of the three have the label 1. Moving to n_neighbors = 5 is where things get difficult, because for our X_train the five rows with smallest distances will have labels [0, 1, 1, 2, 2] leading to a tie. This is precisely what I don’t want! So if I’m committed to keeping X_train and y_train constant the best thing to do is skip n_neighbors = 5 and instead consider n_neighbors = 7. That is simply because for the given X_train and y_Train above, I can get a single most common value from the labels [0, 1, 1, 2, 2, 2, 2].

Putting the above discussion altogether, for our X_train and y_train, the predicted label, y_test will vary for different values of n_neighbors. So our updated predict() test will look like,

Now we can break the tests! We don’t need to do anything directly to predict() as it is just the interface. To make this easier, I want to split __closest into two methods, __closest, which will now find and return a list of the distances betweet X_test and the rows in X_train, and __vote, which will gind the most common values out of the smallest n_neighbors distances.

There are few things going on the code sample worth highlighting,

• As we want to return the label, not the distance, I chose to create a tuple for each distance and the corresponding label in y_train. These are then appended to the list
• To make __vote simpler, we are passing a sorted list of tuples, sorting by the distance. There are many ways to do this, but I used a lambda function
• __vote should be fairly self-explanatory. To make things simpler, I’m using a Counter object to find the most common label from the list of labels

Using a list comprehension we can make this much cleaner,

The full code can be found here.

## Evaluation

Putting this altogether, and rerunning the first code sample only substituting the sklearn KNN classifier for our own we obtain,

Neighbours Our Accuracy Benchmark Accuracy
1 90.7% 90.7%
3 93.3% 93.3%
5 96% 96%
7 96% 96%

Which is great news: we are able to obtain the same accuracy with our simple implementation. However our implemenation is several orders of magnitude slower than sklearn’s. Interesting to note, but perhaps not suprising, our original implementation with n_neighbors hardcoded to one has a faster running time of around 80 ms.

Neighbours Our Running Time (ms) Benchmark Running Time (ms)
1 95.4 0.451
3 94.3 0.539
5 88.6 0.458
7 102 0.465

All the best,

Tom

## TensorFlow at 10,000 Feet

TensorFlow is an opensource library maintained by Google that has become a particularly popular choice for deep learning. Among the reasons for its popularity,

• Simple Python API,
• Resource management across multiple CPUs and GPUs, or even a remote server,
• A lot of other conveniences that make for quick prototyping and debugging.

In this post, I want to provide a very high-level overview of TensorFlow. We won’t get to code anything meaningful, but instead focus on some of the key concepts that make this library so confusing at first blush. To my mind at least, these are,

1. What is the TensorFlow workflow?
2. How to use constants, variables and placeholders?
3. How to share scope?
4. Session Management

Hopefully this post will go some way to explaining these points and more.

## 1. TensorFlow Workflow

Fundamental to understanding how to use the library is understanding the TensorFlow workflow, which breaks into two parts,

• Construction phase: where we declare the values and operators
• Execution phase: evaluate values and operators

Key to this distinction between construction and evaluation phases, is how TensorFlow thinks about your code internally. TensorFlow represents the code written in the construction phase as a dataflow graph abstraction. Without going into too much detail, this abstraction is how TensorFlow is able to execute operations in parallel across multiple CPUs and GPUs without additional difficulty to the programmer. This graph is composed of nodes and edges, where the nodes are operators and the edges are the values declared in the construction phase. All values are implicitly treated as tensors behind the scenes, and by tensor, we basically mean a nested Python list.

Aside: Bringing these concepts together we can kind of speculate the origin of the TensorFlow name, where “Tensor” refers to the basic unit of abstraction in the library and “Flow” being the flow of data through the data graph.

So up to this point, we’ve defined our values but can’t execute them - this is performed by a session. This is also where TensorFlow initialises our values defined in the construction phases as well as where it manages the hardware resources for us.

Sessions also underline the need to stress the distinction between the construction and execution phases: graphs are not stateful. That is, multiple sessions reusing the same graph will not share state. However, we can choose to checkpoint and restore our values, as we’ll see below. Before getting to the code samples, we need to introduce the different tensor types.

## 2. Constants, Variables and Placeholders

TensorFlow defines three main types of tensor for us: constants, placeholders, and variables. Constants are values that will not change, variables are values that can change, and placeholders are values that are assigned when we execute the graph. To make this a bit more concrete, we can relate these tensor types to a typical machine learning model. Constants for us will define hyperparameters values that are set once per model, variables will define values we want to change automatically during training like weights and biases, and placeholders are values we fed in from our dataset like our training labels.

Use of constants are the easiest to show. We simply assign constants to variable at the top, launch a session in a with statement, and evaluate the expression “a + b” using sess.run. We’ll be using a similar pattern in all the code samples in this post. By using a with statement to encapsulate our session, we ensure that session resources are freed up outside of the statement.

Moving onto placeholders, these are the access point for custom values into the session. To do this we use a feed to pass placeholders into the session. The setup is very similar to constants, but instead of setting a value we just declare the variable type.

Moving on to variables, the setup is as per constants but if we run this code sample as is,

We get an error. This tells us we need initialise the variable. We need to initialise variables in order to set the type and shape of the variable, which remains fixed for the graph. This is similar to how Python variables will raise a NameError if declared without a value. To fix this we call an initialiser like so. The easiest approach to take for us is to call the global initializer, which will initialise all variables at once. We can also initialise targeted variables as well as we will see in the final code sample of this blog.

## 3. Scope

Scope is important for precisely the reasons you’d expect: it allows us to namespace variables to ensure correct reuse. The code sample below gives a typical use case, the variable a is scoped to my_variables.

The 0 in the variable name isn’t anything to worry about - it is just because of how TensorFlow handles tensors behind the scenes.

## 4. Session Management

So far we’ve seen how we execute a graph within a session, but we can also use a session to save and restore variables, using the Saver class. The following demonstrates how we can checkpoint a session,

And we can subsequently load a previous session,

Session management is made use of a lot in the wild, where we can save variables after training a model, and later on load these variables before testing. Below is an example of initialising variables in combination with restoring from a previous session.

All the best,

Tom

## Pandas for Fun and Profit

Pandas is a powerful Python library for data analyis. Pandas is not the sole choice within the Python data analysis ecosystem, with a notable example being Numpy. However, Numpy is not very well suited for handling large tabular datasets of mixed types. Pandas itself is built on top of Numpy, so still has the conveniences of Numpy with added abilities. In this blog post, I want to review some of the key concepts that set Pandas apart from Numpy.

The mainstay of an Pandas program is a data structure known as a DataFrame. A DataFrame can be thought of as essentially a list of lists, but unlike a regular Numpy array a Pandas DataFrame can

1. Contain multiple data types,
2. Support non-numeric indices,
3. And handles missing values better.

The remainder of this post will look at each of these in turn.

For the purpose of illustration, we will be using a the “Daily Show Guests” dataset from the FiveThirtyEigth repo. We can create a DataFrame by reading in the CSV file,

DataFrames themselves are made up of another Pandas object known as Series objects, which is how Pandas represents the individual columns in a DataFrame. Series objects themselves are wrappers around Numpy arrays. Querying a DataFrame object by column name will return a Series object.

## 1. Multiple Data Types

One of the biggest differences from Numpy arrays is that Pandas Dataframes can contain different data types across different columns. Reading in the CSV file as above simply wouldn’t have been possible with Numpy. It is easy to verify the types by named column with the dataframe we created,

The above response might initially be confusing if you look at the CSV file, and highlights a subtlety we face when using Pandas. For instance, why do entries of the “Group” column have an object type? This is a consequence of Pandas dependency on Numpy: these object types are in fact Numpy objects and this is how Pandas represents strings of arbitrary length, rather than use Python strings.

As we will see in section 3, being able to support mutiple types has important consequences for how we can handle “missing values”.

## 2. Custom Indices

Given a DataFrame, we might want to use something other than the default integer index to make intuitive selections on the initial DataFrame. We do this using the set_index method,

We specifiy the drop=False attribute, as we don’t want to delete the “Show” column after setting our custom index.

Even after sepcifying a custom index on a DataFrame, we still retain the original integer index so can use any either as a means of filtering. Either way, we have three ways of filtering a DataFrame by rows or columns

• DataFrame.iloc - integer based indexing
• DataFrame.loc - label based indexing
• Python slice range

## 3. Handling Missing Values

Firstly, what exactly do we mean by a “missing value”? Put simply, a missing value is any value that is not present in the data. Seems easy, right? However, in the Python programming language a missing value can be represented by either a None object type value or nan float. This means that conceivably a missing value can be mapped to two distinct types, which may have all kinds of unintended consequences if we’re not careful. To take an example from this excellent blog post on the topic,

Instead of dealing with this messiness we can instead rely on Pandas to handle our missing values. Pandas treats None and nan interchangeably, and performs the necessary type casting behind the scenes, all we have to do is decide on some identifier for missing values. In the “Daily Show Guests” dataset, the string “NA” is used to mark missing values.

Pandas also gives us a set of methods we can call on our Series or DataFrame objects to deal with missing values. For example if we want to find rows with missing values we can call isnull() on the DataFrame, and then drop the rows with missing values.

Although this means we’re losing data, this might be acceptable if we’re just interested in high-level correlations. If we want to be more careful we could replace missing values with something else using the fillna() method.

All the best,

Tom

## Hosting External Events

My latest post is available to read on OpenTable Tech UK’s blog: “Hosting external events”.

All the best,

Tom

## Neural Networks from Scratch*: Presentation

Follow the link for slides to my talk Neural Networks from Scratch*.

All the best,

Tom

## Introduction

In the first post in this series, we created a deep learning neural network so now we have everything in place to actually solve the learning problem we originally introduced: how to create a neural network to correctly match the output of an XOR gate. To do this we need adjust our network weights in response to the output our network produces from the training data. This is what the backpropagation algorithm does. The backpropagation algorithm follows in order,

1. Forward propagation: Pass input into our network, and get some output
2. Error calculation: Find the error for this output using an error function
3. Backward propagation and gradient descent: Work out how this error relates to each weight in our network, and the update the weights in our network, which hopefully improves our output

And iteratively through these three steps until the network converges to the optimal set of weights for the learning problem. This post will take us through a single run of this algorithm.

## Forward Propagation

In order to get some output from our network we need to put some input into it, and move this through the network. Forward propagation is a process of finding the activation for a set of inputs and weights layer by layer. This is simply the dot product of the inputs with the weights,

activation = nodes • weights


For the hidden layer we have the additional step of plugging this result through the activation function,

activation = relu(nodes • weights)


For our network, taking a single hidden layer node, we have,

1. Input layer values A, B, just comes from the training data
2. Hidden layer node, H = relu([A, B] • [w_ih_A, w_ih_B])
3. Output node, O = H • w_ho = relu([A, B] • [w_ih_A, w_ih_B]^T) * w_ho

This is exactly the same process we would use to find out the value for O, our output value, for all nodes and weights. The code for this step is as follows,

As can be seen in the code sample, the process is run over an arbitrary number of iterations for the sake of illustration.

## Error Calculation

From the previous step, we got some output from the network - but how do we know if this is any good? We need to be able to find how closely this matches the training data output values, or in other words, find the error that our network output produces. After all, we want our to learn the training data and as we said in part 1, learning is just a process of error minimisation. To quantify the network error we are going to be using something called the squared error,

error = (output - target) ** 2


Our error will only ever be a value greater than or equal to zero, where the lower the value the better our network is doing. Think about it this way, if I shoot an arrow at a target, I don’t care if I’m above/below or to the left/right of the target - just by how much I missed.

You may still have the lingering question as to why we introduce this “extra” quantity of error, when we are really jsut worried about the network weights. It is simply that the error is very easy to track and see improvements in the network. Think of it this way, if the error increases we know the network is doing worse, and conversely if it decreases. If instead we just tracked the weights, we wouldn’t necessarily know what a given change in the weights meant for our network’s performance. Furthermore, as we’ll see in the next section an error function like the squared error is guaranteed to converge - at least for some choice of parameters. This implies that we can achieve the optimal set of weights for the learning problem using this approach. The

## Backward Propagation and Gradient Descent

At this point we have some output from our network and an error. So we know how well our network is doing, but how do we improve this error? It turns out that we can only change the weights. We have at our disposal the weights and the training data. However, we don’t want to change the training data as we want this to be an accurate reflection of the problem at hand. If we tweak the training data to get the network to train more efficiently we are only selling ourselves short: Once we move to the test case, the network would likely underperform as it not learnt data matching the learning problem. The takeaway is that we should only think about changing the weights during this process.

During this step, we work backwards through the network and determine how much to change each weight by finding how the error function changes with respect to each weight. The actual process of updating each weight is gradient descent’s responsibility. To illustrate how this is done, we will look at a single weight between each layer, see the network diagram below,

where as before we have placeholder variables for each node value as well as for the weights, where w_ho is the weight connecting the hidden layer node to the output not, and w_ih is the weight connecting the input node to the hidden layer node.

Firstly let’s determine how the error function changes in response to w_ho. We know what the error function looks like,

error = (output - target) ** 2


but we need to get this in terms of w_ho. To do this, recall that the output, O for a single node is given as the multiplication between the activation of the node in the previous layer with the weight connecting the two,

output = activation * weight


For this example the activation is given by H (we don’t care where this came from as we’ll justify in a bit), and the weight is w_ho, so plugging these into the above equation we get,

O = H * w_ho


With this expression for the output, O, we can plug this into our error function to get,

error = ((H * w_ho) - target) ** 2


giving us the error function in terms of w_ho. We can make a further simplification by reminding ourselves where H and target come from. The target value comes from our training data and so as discussed above is not something we are free to vary. The activation, H, is also a constant during this process as we are not feeding in new input values during this step of the backpropagation algorithm. Altogether that means that the only variable we have to play with is the weight, so the error function is simply the square of w_ho,

error ~ w_ho ** 2


This is just a parabola, which has a simple graph,

Indicated on the graph as a green point is the minimum of the error function, where the error is zero, which is the point we want to reach. The corresponding weight that minimises the error is the goal weight. To reach the minimum from an arbitrary point on the parabola, we need to ask ourselves: what is the change in the error for a given change in the weight value? The approach that gradient descent takes is greedy, it just takes the change in the weight that changes the error most. This is just a verbose way of saying we need to find the derivative of the error function with respect to a given weight to determine the weight update. Indicated below, the derivative is the slope of the line at the point given by the dashed lines.

In general, the derivative gives us the relationship between a set of variables, in our case that is how the change in the the weight, ∆weight, relates to the change in the error, ∆error, which is just the product of the derivative with the change in weight,

∆error = derivative * ∆weight


The derivative is calculated from the expression for the error function given above. I’ll just give it to you, but it is quite easy to find for yourselves,

derivative ~ H * (O - target)


We now know how much to update the weight, which is just the derivative, but do we add or subtract this amount from the current weight value? Going back the parabola that gives the relationship between the weight and the error we find that we add or subtract this update depending as to which side of the goal weight we find ourselves. If we are to the left of the goal weight, the derivative is always negative, as the graph is sloping downwards. However, in order to decrease the error, we need to increase the weight to move towards the goal weight. This means we should take the negative of the derivative as our weight update, so this is a positive number. The converse argument holds if we start to the right of the goal weight: The derivative is positive, but we need to decrease the weight to decrease the error. Both scenarios are illustrated below.

Putting this all together, the weight update for weights connecting the hidden layer nodes and output nodes, like w_ho, is given by,

If we were to repeat this process for many iterations, we simply continue until we converge to this goal weight. A consequence of the error function we’ve chosen for our network, a parabola, means that the goal weight is in fact a global minimum, meaning that as long as we can guarantee convergence we will achieve the optimal weight for the learning problem. This is something we’ll return to later.

Now that we know how to update the weights between the hidden and output layer, we can use the same process to update the weights connecting the input and hidden layer. Going back to the network diagram,

We need to find the change in the error function with respect to w_ih. This is the same process as before but complicated by the activation function. We need to express the error function as a function of w_ih. To do this we need to first express the output O in terms of H and w_ho, which as before is just the product,

O = H * w_ho


This time around however, H is not fixed for us as it comes from w_ih which we are varying. This means we need to express H in terms of A and w_ih. H is a hidden layer node, so the activation H is given as,

H = relu(A * w_ih)


now plugging this into our previous equation for O,

O = relu(A * w_ih) * w_ho


and finally we plug this into the error function to express the error function in terms of w_ih,

error = (relu(A * w_ih) * w_ho - target) ** 2


Having expressed the error function in terms of w_ih we need to find the derivative of the error function with respect to w_ih, which will give us the size of the weight update. This is more complicated than before as we need to take the derivative of the activation function, relu, into account as well. As before, I will skip over the details and just give the result,

derivative ~ (O - target) * reluDeriv(A * w_ih) * A * w_ho


This expression looks pretty dense, especially as we have factors of w_ho and reluDeriv which we haven’t seen in the previous weight update. There is however a fairly intuitive explanation for their presence, which relates to the key aspect of the backpropagation process: we are going back through the network and attributing the final error to each of the weights. To illustrate this point, take for example the limit of w_ho, that is as it approaches zero. If w_ho were zero, then that would imply that the activation H, and any weight that produced H, couldn’t have contributed to the final error - so we shouldn’t update w_ih in this iteration. A similar reasoning applies to reluDeriv. In part 1, we saw that relu produced an activation equal to or greater than zero, which means either the activation did or didn’t contribute to the final error. reluDeriv captures this intuition, which is given by the following function,

This returns 1 if and only if the A * w_ih is greater than zero, otherwise it returns 0, where a returned value of 1 means the activation did contribute to the final error and 0 if it didn’t.

Altogether, these derivatives give us all the weight updates we need for our network. But do these converge to a set of goal weights? Above, I’ve teased that this should be possible, but there is a chance that convergence may not occur. This is because all the weight updates we have scale with the activation value, meaning that there is a risk of divergence if the activation grows too large (these are not normalised). For instance, the derivative the error function with respect to w_ho was found to be,

derivative ~ activation * (output - target)


The diagram below illustrates this point. Iteration-by-iteration the network error will increase as our weight moves increasingly further from the minimum, alternating either side of the goal weight.

The fix for this is quite simple: we just introduce a new multiplicative factor to scale the derivative. This factor is termed the “learning rate” and usually referred to as “alpha”. Adding this to the derivative just given with respect to w_ho we have,

weight -= alpha * activation * (output - target)


There are some important considerations to make when choosing alpha, as it has a direct impact on the rate of convergence of the weights - this is by no means a trivial problem in general. However, for our purposes it will be ok to just choose a fixed number between 0 and 1. Putting this all together, our code for backpropagation and gradient descent is,

## Recap

We’ve covered a lot of material in these two blog posts. Taking it from the top, we chose a learning problem, the XOR logic gate, and matched it to a deep learning neural network. We were able to justify our choices for the network such as the error function and other parameters. Having done all this we worked through a single iteration of the backpropagation algorithm to demonstrate how we would update our network weights by adjusting the weights. Hopefully, this goes someway to demonstrate how deep learning neural networks work from the ground up, and how the choice of learning problem influences the choices we make.

Coming to the end of this series we have a complete network, which will be able to train correctly for the XOR learning problem. The full code sample is available as a gist. If you have any other questions please get in touch.

## References

My main resources for these blog post were “Grokking Deep Learning” by Andrew Trask, “Machine Learning Study Group” notes by Nikolay Manchev, and “Neural Networks and Deep Learning” by Michael Nielsen.

All the best,

Tom

## Introduction

This blog post and the next are to complement a talk I will give at Infiniteconf. In this series of posts, we will be investigating a learning problem: determining the correct output of an XOR logic gate. To do this, we will build a simple feed-forward deep learning neural network, which will use the backpropagation algorithm to correctly learn from the data provided for this problem. In this first post we will be covering all the details required to set up our deep learning neural network. In the next post, we will then learn how to implement backpropagation, and how the this learning algorithm is used to produce the output matching an XOR gate as closely as possible. To save us some effort moving forward, we will be using numjs to handle numeric operations we need to perform, hence the asterisk in the title.

Before we get to any detail, it’s probably worth reflecting on why deep learning has become such a hot topic in the industry. In a nutshell, deep learning has been shown to outperform more traditional machine learning approaches in a range of applications. In particular, very generic neural networks can be very powerful when it comes to complicated problems such as computer vision: a deep neural network with no specific instruction of its target domain can classify images effectively.

Coming back to the learning problem at a very high level, ultimately we are simply trying to get the correct output for a given input. We are going to follow an approach that falls into the machine learning paradigm known as supervised learning. Initially, we train our network by giving it a set of training data with known inputs and outputs. During this process the network is expected to updated itself so it most closely produces the expected output for the given input - we will cover this in detail in the next blog post. Once we are satisfied with the network’s training progress, we move to a test case. This is where we provide the network with inputs, without telling it upfront what the output should be. Ideally it should be able to match the expected output very closely, provided it has been adequately trained.

I’ve used the word “learning” a few times already, but what do we mean exactly in this context? Simply put, we can think of “learning” as being the process of minimising the network error as it works through the training set. To accomplish this, the network will need to make some adjustments in response to an error calculation which we will cover in part 2. This can equivalently be thought of as a process of finding the correlation or relationship between the training data inputs.

## Setup

We better start building our network. Let’s begin by reviewing the example problem we are going to cover, the XOR logic gate. The possible inputs and outputs for XOR can be exhaustively given by the logic table below,

A B A XOR B
0 0 0
0 1 1
1 0 1
1 1 0

The XOR gate tells us that our output will only ever be 1 if our inputs are different. This is precisely why XOR has the longform name “exclusive or”, we get a output of 1 if either A or B are 1 but not both.

Given the logic table above, we can say that our input with be an array of length 2, with entries given by A and B respectively. Our output will simply be the integer either 1 or 0. Our training data follows from the logic table, where nj is our import of numjs,

What should our network look like to reflect these inputs and outputs? The building block of a neural networks are nodes, these contain activation values that is values our network produces at an arbitrary point in the network. These are basically the “neurons” of our network. The nodes are collected together in layers, such that each layer has common inputs and outputs. Given this information, we will have two layers, an input and an output layer, with the former made of two nodes, and the later just one. Putting this altogether, our network will look something like this,

Where the two input nodes on the left hand side have placeholder values A and B, and our output node has a value O. I’ve also snuck in some numeral values hovering around some arrows connecting our nodes. These are “weights” and they indicate the strength of the relationship between any two nodes - where do they come from? The answer is we randomly pick them. For more sophisticated approaches, you might want to take into account the number of inputs and the underlying problem when selecting weights. We will instead use the simpler approach of choosing from a normal distribution with mean 0 and standard deviation 1. We will also add in another restriction to our weights: they must be between -1 and +1. For our network of two input nodes and one input node, our weights can be generated as follows,

Moving forward, we will often make strange, seemingly arbitrary, choices amongst our parameters. - this is always for the same two reasons. First, we want to keep things as simple as we can for the sake of illustrative purposes. Secondly, we want to make choices to ensure our network learns as quickly as possible, or at least that we don’t make the learning process more difficult than it needs to be.

Together the node values and weights allow us to calculate the value of other nodes in the network, known as the activation. The values of A and B come from our training data, and given our weights we can find the value of O by taking the weighted sum of A and B. The network above would give the value for O in terms of A and B as,

O = 0.4 * A + 0.6 * B


This is true of an arbitrary number of weights and inputs. Instead of working out this sum for each possible output node, we can instead use the dot product which does exactly this sum but across a whole layer at once. Using the dot product, the above equation would become

O = [A, B] • [0.4, 0.6]


We will not do this explicitly in the code sample and instead rely on the implementation provided bynumjs.

At this point we have a working neural network! However, this will not be able to correctly match the XOR output. To do this, we will need to create a deep learning neural network. To walk through the reasons why, I will first introduce a logic gate that could be solved by our present network. The example being the AND gate, with the logic table below,

A B A AND B
0 0 0
0 1 0
1 0 0
1 1 1

This logic gate can be correctly mapped to what is known as a linear neural network. The term “linear” used in relation to neural networks refers to both a condition relating to both the input nodes and the expected output, and in either case have important implications for our network. To demonstrate, I’m going to give the AND gate logic table a different representation,

This graph gives the logic table where the inputs correspond to horizontal and vertical axes, and the output appears in red. We can read, for example, the output corresponding to A = B = 1 as being 1 as expected. Given how the output appears in the graph, we can separate the two kinds of output, 0 or 1, with a single linear dividing boundary, given below

Output that can be divided like this are called “linearly separable”. This is a condition for a problem to be solvable by a linear neural network. Thinking in terms of the inputs instead, given that a linear neural network can produce the correct output implies that exists a correlation between the input nodes that our current neural network correctly captures. It not really important to go into this in any more detail, but bearing this result in mind we can return to the XOR gate. Unlike the AND gate, the XOR gate cannot be solved by a linear neural network precisely because the output is not linearly separable. This point is better illustrated by a graph like the one given for the AND gate,

You should be able to eyeball it: we cannot find a single linear dividing boundary between the 0 and 1 outputs.

Moving on from this result, we need to add something else to our network to make it work for an XOR gate: we need to introduce nonlinear behaviour. We need to start by adding other layers of nodes between the input and output layers. This is known as a hidden layer, and is used to create intermediate correlations from our input nodes. In this context, correlation simply means some combination of nodes that produces a new activation. As with other choices we’ve made for our network so far, we are going to go with the simplest choice possible. In this case, we’ll add a single hidden layer with three nodes. Our network now looks like this,

Looks great right? But this is still a linear network. To illustrate, let’s focus on one hidden layer node and the two input nodes and output node it connects to like so,

This should look familiar, with the new hidden layer node given a placeholder value H. To find the value of O in terms of A and B we first have to find the value of H. This is simply the weighted sum of A and B,

H = 0.4 * A + 0.6 * B


O is just this expression multiplied by the weight connecting H and O,

O = 0.5 * H = 0.5 * (0.4 * A + 0.6 * B)


And expanding this out,

O = 0.2 * A + 0.3 * B


This looks exactly like the output we’d expect from our original linear network, which means the hidden layer node is not adding anything new to our network. In fact our hidden layer as a whole is not giving us any new correlations, so we might as well not have it in this case, and have a network with slightly tweak weights as below,

To move away from a linear network we need to add yet another component: activation functions. This is precisely so that the hidden layer activations is not just some weighted of input nodes, so our network is not simply reducible to a single input and output layer. Activation functions are just mathematical functions that determine whether a hidden layer node turns “on” or “off” for a given input. They fit within the common conceptual picture of neurons in the brain. Like our hidden layer nodes, neurons receive an electrical signal and may signal on or off if this input reaches a certain threshold. The choice of activation ultimately comes down to the kind of problem we are dealing with. The activation function we will be using is known as the “relu” function, which follows the rules:

• If the input value < 0 then return 0
• Otherwise return the original input

Turning to code we get the following,

This has been refactored to use a more functional style, but the basic logic is this same. x => ((x > 0) * x) is the actual relu function as discussed, and the iterator helper function applies this element-by-element on an array of arbitrary size.

That’s a lot of material to cover already, but we now have a deep learning neural network! In the next post we will see how to run input into this network to get output and how to get the network to learn the training set data adequately.

All the best,

Tom