Top 5 Mistakes Companies Make with Data Science

My latest post is available to read at my Medium blog: “Top 5 Mistakes Companies Make with Data Science”. What to avoid in your journey towards data-driven decision-making.

All the best,

Tom

Data Science Succes for Start-ups

My latest post is available to read at my Medium blog: “Data Science Success for Start-ups”. It’s about how to successfully plan for data science projects.

All the best,

Tom

Normalisation Techniques

Introduction

Feature normalisation is a key step in preprocessing data, either before or as part of the training process. In this blog post, I want to discuss the motivation behind normalisation and summaries some of the key techniques used. Most of the material from the blog is based on my reading of Google’s excellent “Data Preparation and Feature Engineering in ML” course.

This particular blog post fits into the wider topic of feature engineering or data transformation as a whole, which relates to other techniques such as bucketing and embedding as well. For the purpose of this blog post I will be conflating techniques which may otherwise be considered separately as standardisation and normalisation proper. The typical distinction between the two is detailed here.

The techniques to be considered are,

  • Scaling (to a range)
  • Feature clipping
  • Log scaling
  • Z-score

In all cases, it is important to visualise your data and explore the summary statistics to ensure the transformation applied is appropriate for the dataset considered.

Why Transform your Data?

In this blog post, we will only consider transformation of numerical features. In this case, the motivation behind normalisation is to ensure the values of a given feature are on a comparable scale. This is to ensure the data quality, and relatedly, better model performance and an accelerated training process. This may be required even in the case of certain gradient optimisers, which can handle the unnormalised data across different features, cannot necessarily handle a wide range of values for a single feature. The data transformation itself can either happen before training or within the model itself. The main tradeoff being that the former is performed in batch whereas the latter is performed per iteration. Deciding between these two approaches will also depend as to whether the model lives online or offline.

Scaling

This is simply mapping from the given numeric range of a feature to a standard range, typically between zero and one, inclusive. Achieve this by transforming using min-max scaling,

\[x'=\frac{x-x_{min}}{x_{max}-x_{min}}.\]

This transformation is particularly appropriate if the upper and lower bounds of the data are known, with few or no outliers, and the data is uniformly distributed.

Feature Clipping

Set all feature above or below a certain threshold to a chosen fixed value. This threshold (or thresholds) is an arbitrary number, in some cases it is taken to be a multiple of the standard deviation. May apply feature clipping before or after other normalisations, which is useful in the case of other transformations that assume there to be few outliers.

Log Scaling

This is appropriate when the distribution of datapoints follow a power law distribution. This transformation aids in applying linear modeling to the dataset. The base of the log is, generally speaking, not that important to the overall transformation.

Z-Score

Transform feature value in terms of the number of standard deviations away from the mean. The transformed distribution will have a mean of zero and standard deviation of one. It is desirable, but not necessary, that the feature values contain few outliers. The equation is as follows,

\[x'=\frac{x-\mu}{\sigma}.\]

All the best,

Tom

Hierarchical/Multi-level Indexing in Pandas

Introduction

A key stage in any data analysis procedure is to split the initial dataset into more meaningful groups, which can be achieved in Pandas using the DataFrame groupby() method. It can be more useful still to manipulate the returned DataFrame into more meaningful groups using hierarchical (interchangeably multi-level) indices. In this blog, we will go through Pandas DataFrames indices in general, how we can customise them, and typical instances when we might use hierarchical indices. The data used in the code examples in this article come from UK population estimates provided by the ONS.

DataFrame Indices

First things first, let’s get the initial dataset we’ll use for this article.

# Get initial DataFrame
url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
data = jsonstat.from_url(url)
df = data.to_data_frame('geography')
df.reset_index(inplace=True)

The above will return a DataFrame containing a chronological record of population counts by UK region and year, as well as some basic demographic information such as sex and age group.

In general, whenever a new Pandas DataFrame is created, using for example the DataFrame constructor or reading from a file, a numerical index of integers is also created starting at zero and increasing in increments of one. By default, this is a RangeIndex object, as you can confirm by looking up the DataFrame’s index attribute,

df.index
#=> RangeIndex(start=0, stop=22800, step=1)

Pandas also supports setting custom indices from any number of the existing columns using the set_index() method. By specifying the keyword argument drop=False we make sure to retain the column after setting our custom index. Even after specifying a custom index on a DataFrame, we still retain the original integer index so can use either as a means of filtering or selecting from the DataFrame. DataFrame.iloc is used for integer-based indexing, whereas DataFrame.loc is used for label-based indexing. Ideally an index should be a unique and meaningful identifier for each row. This is precisely why we may choose to use multi-level indexing in the first place. Given a custom index, we can revert to the standard numerical index with DataFrame.reset_index(). For our purposes, this doesn’t make too much sense, but imagine having a collection of measurements for a set of unique datetimes: We could choose the datetime as our index.

Groupby & Hierarchical Indices

A more typical scenario where we would come across hierarchical indices is in the case of using the DataFrame.groupby function. Given our DataFrame above, imagine we wanted to find the breakdown of the latest population statistics by region and sex,

# This line is to find the population count by sex for the most recently recorded year i.e. df.data.max()
df = df[(df.date == df.date.max()) & (df.age == 'All ages') & (df.measures == 'Value')][['geography', 'sex','Value']]

# This is just to standardise the column names
df.columns.values[0] = 'region'
df.columns.values[-1] = 'value'

Calling groupby() on this DataFrame will allow us to group by the desired categorise for our analysis, which in this case will be the region and sex. We want to find the sum total populations conditioning for region and sex,

df_grouped  = df.groupby(['region', 'sex']).sum()

# | region            | sex    | value      |
# | ----------------- | ------ | ---------- |
# | England and Wales | Female | 29900600.0 |
# |                   | Male   | 29215300.0 |
# |                   | Total  | 59115800.0 |
# | Northern Ireland  | Female | 955400.0   |
# |                   | Male   | 926200.0   |
# |                   | Total  | 1881600.0  |
# | Scotland          | Female | 2789300.0  |
# |                   | Male   | 2648800.0  |
# |                   | Total  | 5438100.0  |
# | Wales             | Female | 1591300.0  |
# |                   | Male   | 1547300.0  |
# |                   | Total  | 3138600.0  |

The index of the DataFrame df_grouped will be a hierarchical index, with each “region” index containing multiple “sex” indices. We can confirm this by looking up the index attribute of df_grouped,

df_grouped.index

# => MultiIndex(levels=[['England and Wales', 'Northern Ireland', 'Scotland', 'Wales'], ['Female', 'Male', 'Total']], codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],    names=['region', 'sex'])

We may instead want to swap the levels of the hierarchical index so that each sex index contains multiple region indices. To do this, call swaplevel() on the DataFrame.

df_grouped.swaplevel().index

# => MultiIndex(levels=[['Female', 'Male', 'Total'], ['England and Wales', 'Northern Ireland', 'Scotland', 'Wales']],
           codes=[[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]],
           names=['sex', 'region'])

For presentational purposes, it is useful to pivot one of the hierarchical indices. Pandas unstack() method pivots a level of the hierarchical indices, to produce a DataFrame with a new level of columns labels corresponding to the pivoted index labels. By default, unstack() pivots on the innermost index.

df_grouped.unstack('sex')

#                     | value                                |
# | sex               | Female     | Male       | Total      |
# | region            | ---------- | ---------- | ---------- |
# | England and Wales | 29900600.0 | 29215300.0 | 59115800.0 |
# | Northern Ireland  | 955400.0   | 926200.0   | 1881600.0  |
# | Scotland          | 2789300.0  | 2648800.0  | 5438100.0  |
# | Wales             | 1591300.0  | 1547300.0  | 3138600.0  |

The resulting DataFrame of this “unstacking” will no longer have a hierarchical index,

df_grouped.unstack('sex').index

#=> Index(['England and Wales', 'Northern Ireland', 'Scotland', 'Wales'], dtype='object', name='region')

All the best,

Tom

Investigating Attendee Reviews

My latest post is available to read on Skills Matter’s Medium blog: “Investigating Attendee Reviews”. It’s a look at Skills Matter’s reviews app.

All the best,

Tom

Shining a Light on Black Box Models with LIME

Machine learning models are increasingly the primary means with which we both interact with our data and draw conclusions. However, these models are typically highly complex and difficult to debug. This characteristic of machine learning models leads to them frequently referred to as “black boxes” as there is often very little transparency - or at least very little upfront - of how the input data links with the output the model produces. This is much more of a problem for more sophisticated models, as it is generally accepted that more sophisticated models are equally more intractable.

The LIME project aims to address this issue. I have only recently been introduced to LIME by the course Data Science For Business by Matt Dancho, but it has definitely piqued my interests and opened up an entire area of research that I was previously unaware of. In this post, I want to go through the main motivating factors of the project and review the theory of how it works. Although the project has evolved somewhat from its inception, as I will not be going into code at this stage, the general discussion of the post should still be valid. This is a very interesting topic to me and I will want to return to it in the future, so this post should provide a strong foundation for future blog posts I will write on this topic. The best place to find out more about this is one of the original papers “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, which I’m sure we can agree is one the best named academic papers out there.

Why Clarity is Important

The basic way of interacting with a model is that we either directly or indirectly provide some input data and obtain some output. Especially in the case of the non-specialist, they may have no real understanding of either the link between the input data given to a model and the output it provides, and how the one model compares to another. As a data scientist, you may be able to draw some comparison between models during the test phase of model development such as by using an accuracy metric. However such metrics have their own problems: In general they do not capture the actual metrics of interests we want to optimise for i.e. actual business metrics, and do not, on their own at least, indicate why a model’s output may be less suitable than another, for instance a better performing model may more complicated to debug.

Taken together, the above points relate to issues of the trust placed upon the model. These trust issues relate to two key areas: (1) Can I trust the individual predictions made by the model? (2) Can I trust the behaviour of the model in production? To take this a bit further still, in a world increasingly aware of the practical implications of machine learning models, and especially so now that GDPR has come into action, we can no longer deny the ethical questions surrounding machine learning. It behoves use to understand how the model operates internally. A key step to resolving these issues is to grant a more intuitive understanding of the models behaviour.

As detailed in the original paper, experiments show how human subjects can successfully use the LIME library to choose better performing models and even go on to improve their performance. The key principle to this is how LIME generates local “explanations” of the model, which characterise sub-regions of the model’s original parameter space.

Unpacking “LIME”

The name LIME stands for, Local, Interpretable, Model-Agnostic, Explanations and can really be understood as the mission-statement of the package: The LIME algorithm wants to produce explanations of the predictions of any model (hence model-agnostic). By explanation, we mean something that provides a qualitative understanding between an instance’s components and the model’s predictions. A common way to do this in LIME is to use a bar chart to indicate the individual model components and the the degree to which they support or contradict the predicted class.

Localness has another quality associated with it: the local explanations must have fidelity to the predictions obtained from the original model. That is, the explanation should match the prediction of the original model in that local region as closely as it can. Unfortunately, this brings it into conflict with the need that these explanations be interpretable, that is, representations that are understandable to humans regardless of the underlying features. For instance, imagine a text classification task with a binary vector as output, regardless of the number of input features. By feeding more features into the model of the local explanation, we could anticipate greater accuracy with respect to the original model. However, this would add greater complexity to the final explanation as there will be that many more features contributing to the explanation.

In other words, LIME aims to programmatically find local and understandable approximations to a model, which are as faithful to the original model as possible. Such simplified models around single observations are generally referred to as “surrogate” models in the literature: surrogate, meaning simple models created to explain a more complex model.

The gold-standard for explainable models is something which is linear and monotonic. In turn these mean,

  • Linear: a model where the expected output is just a weighted sum of inputs, possibly with a constant additive term,
\[f(x) = \sum_{i\in n} x_i + c.\]
  • Monotonic: for instance in the case of a monotonically increasing function, the output only increases with increasing input values,
\[f(x_j) > f(x_i) \iff x_j > x_i.\]

This is precisely what LIME tries to do by finding linear models to model local behaviour of the target model.

In order to find these local explanations, LIME proceeds using the following algorithm,

  1. Given an observation, permute it to create replicated feature data with slight value modifications. (This replica set is a set of instances sampled following a uniform distribution)
  2. Compute similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select m number of features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
  6. Use the resulting feature weights to explain local behaviour.

(The above steps are taken from the article “Visualizing ML Models with LIME” by “UC Business Analytics R Programming Guide”.)

The main benefit of this approach is its robustness: local explanations will still be locally faithful even for globally nonlinear models. The output of this algorithm is what is referred to as an “explanation” matrix, which has rows equal to the number of instances sampled, and columns for each feature. At this stage, the matrix produced is sufficient to provide local explanations in whatever form is deemed appropriate. In the next step, this matrix is used to characterise the global behaviour of a model.

Going Global

What about global understanding of the model? To achieve this, LIME picks a set of non-redundant instances derived from the instances sampled in the previous step, following an algorithm termed “submodular pick” or SP-LIME for short.

Once we have the explanation matrix from the previous step above, we need to derive the feature importance vector. The components of this vector give the global importance of the each of the features from the explanation matrix. The precise function mapping the explanation matrix to the importance vector depends on the the model under investigation, but in all cases should return higher values (meaning greater importance) for features that explain more instances i.e. features found across more instances globally.

Finally, SP-LIME only wants to find the minimal set of instances, such that there is no redundancy in the final returned set of instances. Non-redundant, meaning that the set of local explanations found should cover the maximum number of model features with little, if any, overlap amongst the features each individual local explanation relates to. The minimal set is chosen by a greedy approach that must satisfy a constraint that relates to the number of instances a human subject is willing to review.

In short, the approach taken by LIME is to provide a sufficient number of local explanations to explain the distinct behavioural regions of the model. This sacrifices the global fidelity of explanations in favour of local fidelity. As discussed in the original paper, this leads to better results in testing with both simulated and human users. In general though, while global interpretability can be approximate or based on average values, local interpretability can be more accurate than global explanations.

Closing Remarks

I want to return to LIME and the wider topic of machine learning interpretability in future blog posts, including how this works with H2O as well as being able to provide a more in-depth technical run-through of the library.

All the best,

Tom

The Why of Data-Driven Organisations

Something a bit different! My latest blog post is from the transcript for my lightning talk I will deliver at Infiniteconf 2018 in July. This is available to read now on medium. See you there!

All the best,

Tom

Data-Driven Decision-Making and CRISP-DM

A key driver of the rise of the so-called data-driven organisation is the increased awareness and use of data-driven decision-making (DDD) at all levels of the business. This is in a large part because of the increasing availability and quality of collected data, as well as increasing opportunities to make use of it. In this blog post I would like to discuss what DDD is, how DDD relates to data mining, and finally how to approach data mining projects. Much of the material for this post comes from my reading of “Data Science for Business”, by Foster Provost and Tom Fawcett.

Data-Driven Decision-Making

DDD refers to the use of data analysis to drive meaningful decision making. For example, a supermarket may be able to identify triggers that lead to changes in purchasing decisions and manage stock accordingly. Businesses following DDD principles have been found to demonstrate statistically significant improvements in their productivity (See “Data Science for Business” and references therein). However, that is not to say use of DDD should entirely preclude the use of intuition to help inform business decision: rather DDD should be another component to decision processes.

DDD itself relates not just to the use of data science and data mining techniques in isolation, but also to its automation, sometimes referred to specifically as “automated DDD”. It is with automated DDD that practices typical of data science or data mining really come into their own as distinct from other analytical techniques such as statistics, database queries etc. This is precisely because data mining allows a business to automate the search for knowledge and pattern recognition, although in reality though, there may always be an unavoidable manual aspect to this knowledge discovery process.

Previously I referred to “data science” and “data mining” separately. The distinction between the two concepts is always a bit unclear, however this separation is often useful to be able to refer to specific subtasks comprising the data mining cycle (see CRISP-DM below) as separate from the broader field of data science.

CRISP-DM

Having established the desirability of DDD, how do we achieve it? This is where we need to begin by first identifying the business problems we are facing or want to investigate, and see how to approach these using data mining.

Despite the seeming variety of business problems, they largely fit into one of a number of well known data mining tasks, such as classification, regression, and clustering. It is of course quite a skill to correctly identify precisely the kind of problems being addressed so as to decide early on what kind of approach to follow. This process of identification largely follows using a combination of understanding the kind of business problem being addressed and what data is available. Once the kind of data mining task is established, it becomes possible to approach it in a systematic way.

The high-level approach to engaging with business problems with data mining techniques is fairly well established, and is formalised by the framework known as the “Cross Industry Standard Process for Data Mining” or CRISP-DM. The following process diagram gives the relationship between the different phases of CRISP-DM. Diagram by Kenneth Jensen, distributed under a CC BY-SA 3.0 license

There is a lot of detail that can be adeed to this, but to highlight the main features,

  • CRISP-DM is cyclical in its very nature, given by the large circle bounding the diagram. This indicates an iterative approach to data projects
  • The starting point for any iteration should be with “business understanding” step
  • There are also multiple cycles within the overall CRISP-DM cycle, such as that between the tasks of “business understanding” and “data understanding”
  • A complete cycle from “deployment” to “business understanding” will usually follow from the successful completion of a project. This can follow from as new insights are generated by a previously successful model

The CRISP-DM diagram however does not perhaps do a good job of capturing the exploratory nature of data mining projects. This is much more so the case with data mining projects than typical software development projects due to the greater degree of inherent uncertainty e.g. overall project expectations may change from subtask to subtask. This may require a greater reliance on prototyping as opposed to iterative releases.

All the best,

Tom

Word Embeddings with Word2vec

If you’ve read any of my previous blog posts on information retrieval models, you should have come across a reference to a “bag-of-words” model. That is where we just consider the frequency of terms in a given document as the starting point for our more elaborate models. In many cases this is perfectly acceptable simplification and can lead the very powerful applications such as the retrieval models previously discussed. But what are we losing? We no longer no the have any sense of the structure of the original document and in particular the relationship between the words.

In this blog post we will look at one of the most popular and well-known word embedding algorithms, Word2vec. Using a word embedding such as that created by the Word2vec algorithm we can learn the semantic and syntactic relationship between words in our document or set of documents (termed “corpus”). Where I would explain these terms as,

  • Semantic: relating the meaning of the word
  • Syntactic: relating to the spelling/structure of a word e.g. plural vs singular

We will cover the fundamental concepts behind it, how it works, and some competing algorithms.

The reading for this blog post came from a combination of “Natural Language Processing in Action” and the original research papers by Tomas Mikolov et al.

Fundamentals

The Word2vec model continues an established tradition of using neural networks to establish a language model. Word2vec itself is a fairly simple recurrent neural network made of,

  • Input layer - words in corpus are passed in as one-hot encoded vectors
  • Recurrent hidden layer - this uses a sigmoid activation function, with a number of neurons equal to the number of dimensions used to encode all the words of the vocabulary
  • Output layer - this uses a sigmoid function to get a normalised probability vector from the activations of the hidden layer

The actual output value for a given run is taken to be word corresponding to the highest value in the probability vector. Once we have trained the network (see below) we can ignore the output of the network as we only care about the weights: these are what form the embeddings. In general, provided our corpus is not so specialised, we can use a pre-trained word embedding and avoid performing this step ourselves.

A big part of Word2vec is the ability to process the relationship between words, as learned by the word-embedding, using simple linear algebra - so-called “vector oriented reasoning”. Applicable to both the semantic and syntactic relationship between words, this proceeds by way of analogy,

  • Semantic: vec(king) - vec(man) + vec(woman) = vec(queen) i.e. “woman” is to “queen”, as “man” is to “king”
  • Syntactic: vec(apple) - vec(apples) ~= vec(banana) - vec(bananas) i.e. relate singular and plural forms

Where e.g. vec(king) is the word vector embedding given by dot product between input word vectors and weights matrix.

Approaches

How does Word2vec actually learn the relationships between words? This algorithm takes it that these relationships emerge from the cooccurrence of words. There are two main approaches to determine this cooccurrence: the skip-gram approach and the continuous bag of words (CBOW) approach.

Take the following sentence,

“The quick brown fox jumps over the lazy dog”

To generate training data we imagine having a small window that move across the document words. For instance, if this window is of five words width, we only consider samples of five words at a time (in the order they appear in the original document).

In the skip-gram approach, we are trying to predict the surrounding four words for a given input word - the word in the middle of our window. “Skip-gram” because we a creating n-grams that skip over words in the document e.g. want to find relationship between “The” for input “brown” ignoring “quick”. The table below demonstrates what this would look like for the example sentence above. In the table headings used, w_t refers to the input word, and e.g. w_t-2 is the word two places before the input.

Input Word w_t Expected Output w_t-2 Expected Output w_t-1 Expected Output w_t+1 Expected Output w_t+2
The     quick brown
quick   The brown fox
brown The quick fox jumps

The skip-gram approach can be viewed as a kind of “flipped” version of CBOW approach, and vice versa. Instead of trying the predict the surrounding words for a given input word, we are trying to find the target word for the set of surrounding words. This approach is termed “continuous” bag of words, as we can imagine finding a new bag of words for a given target word as we slide the window along our document. The table below demonstrates what this would look like for the example sentence above, with the same notation as the skip-gram example.

Input Word w_t-2 Input Word w_t-1 Input Word w_t+1 Input Word w_t+2 Expected Output w_t
    quick brown The
  The brown fox quick
The quick fox jumps brown

Using the network described above, we want to find the output vector of word probabilities for a given input word. This proceeds as a supervised learning task. Given the one-hot encoding of words in the corpus, each row in the weights matrix (from input to hidden layer) of our neural network is trained to represent the semantic meaning of individual words. That is, semantically similar words will have vector representations - they were originally surrounded by similar words.

When would you choose one approach over the other? The skip-gram approach can have superior performance over CBOW for a small corpus or with rare terms. This is because skip-gram generates more examples for a given word due to the network structure. On the other hand CBOW is faster to train and can produce higher accuracies for more frequent words.

All the best,

Tom

Distributed File Systems and MapReduce

This blog post discusses a solution to the problem in big data. Imagine dealing with a very large dataset, if we want to persist it in any practical way then we have two big problems to deal with due to the sheer size of the dataset. Firstly, the size of the dataset will make it impossible to store on a single machine or disk (in a general case). Secondly, the processing time will become painfully large without introducing some means to parallelise this process. Using a distributed file system (DFS) together with a MapReduce framework will address both these issues. This post will provide a high-level overview of these two technologies. In relation to DFS, we will consider the example of Google File System (GFS).

Google File System

GFS is a means of managing a distributed file system, that is, data storage across multiple machines. The core GFS architecture is based around a single centralised master node, which contains a lookup table to determine the storage location of the the individual files. The files themselves are stored on a much larger number of “Chunkservers”, so-called because they store files in multiple 64 MB chunks - with replication across the network. An application client talks directly to the master node.

MapReduce

MapReduce is a general framework for parallel programming. As above, imagine having a dataset so large that we want to avoid operating on it sequentially. To do this we want to be able to operate on multiple subsets on the dataset independently, but still be able to aggregate the separate subsets later on - afterall it’s the one dataset we’ve just split it up for our own convenience. This is precisely what “map” and “reduce” separately refer to,

  • Map: run some function over each and every element in each subset
  • Reduce: aggregate the subsets

To do this, we assume that our data is separated into key, value pairs. The map function will take in a set of key, value pairs, and return another set of key, value pairs (usually with the key being different afterwards). The reduce function can then group together elements in different subsets by matching on each unique key. Both the map and reduce functions are written by the developer, but the actual execution is left up to the framework.

The key strengths of both technologies are their generality as well as their ability to abstract away low-level details for the developer.

All the best,

Tom