What Are Embeddings? | Will O'Pedia

tags: Machine Learning

Notes

Introduction

NOTER_PAGE: (4 . 0.084175)

embeddings — deep learning models’ internal representations of their input data

NOTER_PAGE: (4 0.19250936329588017 . 0.3538135593220339)

The usage of embeddings to generate compressed, context-specific repre- sentations of content exploded in popularity after the publication of Google’s Word2Vec paper

NOTER_PAGE: (4 0.6794007490636704 . 0.22033898305084745)

the concept of embeddings can be elusive because they’re neither data flow inputs or output results - they are intermediate elements that live within machine learning services

NOTER_PAGE: (5 0.4719101123595506 . 0.2997881355932203)

embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations.

NOTER_PAGE: (5 0.5468164794007491 . 0.4163135593220339)

Transforms multimodal input into representations that are easier to perform intensive computation on, in the form of vectors, tensors, or graphs

NOTER_PAGE: (5 0.6134831460674157 . 0.25)

Compresses input information

NOTER_PAGE: (5 0.6816479400749064 . 0.24152542372881355)

changes variable feature dimensions into fixed inputs,

NOTER_PAGE: (5 0.748314606741573 . 0.4724576271186441)

Creates an embedding space that is specific to the data

NOTER_PAGE: (5 0.8029962546816479 . 0.2510593220338983)

can also generalize to other tasks and domains through transfer learning

NOTER_PAGE: (5 0.8389513108614233 . 0.2521186440677966)

Generally, we represent individual embeddings as row vectors.

NOTER_PAGE: (6 0.14681647940074907 . 0.583686440677966)

tensor, also known as a matrix, which is a multidimensional combination of vector representations of multiple elements.

NOTER_PAGE: (6 0.2202247191011236 . 0.3273305084745763)

These embeddings are the output of the process of learning embeddings, which we do by passing raw input data into a machine learning model.

NOTER_PAGE: (6 0.32359550561797756 . 0.22033898305084745)

transform that multidimensional input data by compressing it, through the algorithms we discuss in this paper, into a lower-dimensional space. The result is a set of vectors in an embedding space.

NOTER_PAGE: (6 0.3595505617977528 . 0.1906779661016949)

talk about item embeddings being in X dimensions, ranging anywhere from 100 to 1000, with diminishing returns in usefulness somewhere beyond 200-300

NOTER_PAGE: (6 0.550561797752809 . 0.3029661016949153)

One em- bedding layer is computed for each layer of the neural network. Each level represents a different view of our given token

NOTER_PAGE: (8 0.10786516853932585 . 0.5508474576271186)

We can get the final embedding by pooling several layers,

NOTER_PAGE: (8 0.16254681647940075 . 0.288135593220339)

we are often interested in comparing two given items to see how similar they are.

NOTER_PAGE: (8 0.26741573033707866 . 0.4830508474576271)

Engineering systems based on embeddings can be computationally expensive to build and maintain

NOTER_PAGE: (8 0.700374531835206 . 0.21927966101694915)

Recommendation as a business problem

NOTER_PAGE: (9 . 0.662869)

How do we solve the problem of what to show in the timeline here so that our users find the content relevant and interesting, and balance the needs of our advertisers and business partners?

NOTER_PAGE: (10 0.5835205992509364 . 0.22139830508474576)

is this really what you want to work on?

that is relevant, interesting, and novel so they continue to use the platform. If we do not build discovery and personalization into our content-centric product, Flutter users will not be able to discover more content to consume and will disengage from the platform.

NOTER_PAGE: (10 0.7423220973782771 . 0.1885593220338983)

god speed

needs a recommender system most when they are not sure what they want to watch.

NOTER_PAGE: (11 0.35805243445692886 . 0.23940677966101695)

Building a web app

NOTER_PAGE: (11 . 0.58836)

Rules-based systems versus machine learning

NOTER_PAGE: (13 . 0.260702)

In these systems, we don’t start with business logic. We start with input data that we use to build a model that will suggest the business logic for us.

NOTER_PAGE: (13 0.6674157303370787 . 0.3718220338983051)

Building a web app with machine learning

NOTER_PAGE: (15 . 0.397656)

Feature Engineering and Selection - The process of examining the data and cleaning it to pick features.

NOTER_PAGE: (15 0.6112359550561798 . 0.2510593220338983)

This piece always takes the longest

NOTER_PAGE: (15 0.6951310861423221 . 0.3305084745762712)

We select the features that are important and train our model,

NOTER_PAGE: (15 0.7617977528089888 . 0.3061440677966102)

Embeddings are also the output of this step

NOTER_PAGE: (15 0.7932584269662921 . 0.475635593220339)

supervised, where we have training data that can tell us whether the results the model predicted are correct

NOTER_PAGE: (16 0.5166240409207161 . 0.26582278481012656)

unsupervised, where there is not a single ground-truth answer.

NOTER_PAGE: (16 0.5492327365728901 . 0.30560578661844484)

Formulating a machine learning problem

NOTER_PAGE: (17 . 0.37471)

A machine learning model is a set of instructions for generating a given output from data.

NOTER_PAGE: (17 0.4993606138107417 . 0.2197106690777577)

we have a UID (userid) and some attributes of that user, such as the number of times they’ve posted and number of posts they’ve liked. These are our machine learning features.

NOTER_PAGE: (17 0.7557544757033249 . 0.3236889692585895)

We take two parts of this data as holdout data that we don’t feed into the model. The first part, the test set, we use to validate the final model on data it’s never seen before. We use the second split, called the validation set, to check our hyperparameters during the model training phase.

NOTER_PAGE: (18 0.2749360613810742 . 0.2179023508137432)

usual accepted split is to use 80% of data for training and 20% for testing.

NOTER_PAGE: (18 0.39514066496163686 . 0.5488245931283906)

How do we know our model is good? We initialize it with some set of values, weights, and we iterate on those weights, usually by minimizing a cost function. The cost function is a function that models the difference between our model’s predicted value and the actual output for the training data.

NOTER_PAGE: (19 0.6138107416879796 . 0.2260397830018083)

The average squared difference between an observation’s actual and predicted values is the cost, otherwise known as MSE - mean squared error.

NOTER_PAGE: (19 0.7320971867007673 . 0.6428571428571428)

We’d like to minimize this cost, and we do so with gradient descent.

NOTER_PAGE: (19 0.8369565217391305 . 0.22694394213381555)

our loss should incrementally decrease in every training iteration.

NOTER_PAGE: (20 0.09782608695652174 . 0.6428571428571428)

The Task of Recommendations

NOTER_PAGE: (20 . 0.221562)

The goal of information retrieval is to synthesize large collections of unstructured text documents.

NOTER_PAGE: (20 0.34207161125319696 . 0.5524412296564195)

Within information retrieval, there are two complementary solutions in how we can offer users the correct content in our app: search, and recommendations.

NOTER_PAGE: (20 0.3548593350383632 . 0.7106690777576853)

The goal of recommender systems is surface items that are relevant to the user.

NOTER_PAGE: (20 0.6649616368286445 . 0.2197106690777577)

“relevant”?

Collaborative filtering - The most common approach for creating recommendations is to formulate our data as a problem of finding missing user-item interactions in a given set of user-item interaction history.

NOTER_PAGE: (20 0.7691815856777494 . 0.24954792043399637)

Content filtering - This approach uses metadata available about our items (for example in movies or music, the title, year released, genre, and so on) as initial or additional features input into models and work well when we don’t have much information about user activ- ity,

NOTER_PAGE: (21 0.25191815856777494 . 0.2513562386980108)

Many embeddings architectures fall into this category

NOTER_PAGE: (21 0.3369565217391305 . 0.2965641952983725)

Neural Recommendations - The process of using neural networks to capture the same relationships that matrix factorization does without explicitly having to create a user/item matrix and based on the shape of the input data.

NOTER_PAGE: (21 0.48593350383631717 . 0.24954792043399637)

deep learning architectures used for recommendation include Word2Vec and BERT,

NOTER_PAGE: (21 0.5575447570332481 . 0.6835443037974683)

Embeddings are a type of machine learning feature — or model input data — that we use first as input into the feature engineering stage, and the first set of results that come from our candidate generation stage,

NOTER_PAGE: (22 0.4462915601023018 . 0.22423146473779385)

Machine learning features

NOTER_PAGE: (22 . 0.540042)

The process of formatting data correctly to feed into a model is called fea- ture engineering.

NOTER_PAGE: (23 0.4993606138107417 . 0.22242314647377937)

When we have a single continuous, numerical feature, like “the age of the flit in days”, it’s easy to feed these features into a model. But, when we have textual data, we need to turn it into numerical representations so that we can compare these representations.

NOTER_PAGE: (23 0.5121483375959079 . 0.3426763110307414)

Numerical Feature Vectors

NOTER_PAGE: (23 . 0.590726)

Within the context of working with text in machine learning, we represent features as numerical vectors.

NOTER_PAGE: (23 0.6253196930946292 . 0.20072332730560577)

When we create vectors, we can run mathematical computations over them and use them as inputs into ML models in the numerical form we require.

NOTER_PAGE: (23 0.6937340153452686 . 0.7115732368896925)

From Words to Vectors in Three Easy Pieces

NOTER_PAGE: (24 . 0.400817)

NOTER_PAGE: (24 0.5051150895140666 . 0.3887884267631103)

Encoding - We need to represent our non-numerical, multimodal data as numbers

NOTER_PAGE: (24 0.5959079283887468 . 0.2504520795660036)

Vectors - we need a way to store the data we have encoded and have the ability to perform mathematical functions in an optimized way on them.

NOTER_PAGE: (24 0.6425831202046036 . 0.2513562386980108)

we use lookup tables, also known as hash tables, also known as attention, to help us map between the words and the numbers.

NOTER_PAGE: (24 0.7787723785166241 . 0.45027124773960214)

Historical Encoding Approaches

NOTER_PAGE: (25 . 0.580465)

Early Approaches

NOTER_PAGE: (26 . 0.419119)

Encoding

NOTER_PAGE: (26 . 0.801609)

Ordinal encoding

NOTER_PAGE: (26 0.8356777493606139 . 0.19077757685352623)

We can use this method only if the variables have a natural ordered relationship to each other.

NOTER_PAGE: (26 0.8702046035805627 . 0.2576853526220615)

Indicator and one-hot encoding

NOTER_PAGE: (27 . 0.57756)

given n categories (i.e. "US", "UK", and "NZ"), encodes the variables into n − 1 categories, creating a new feature for each category.

NOTER_PAGE: (27 0.6189258312020461 . 0.3589511754068716)

many modern ML approaches don’t require linear feature independence and use L1 regularization17 to prune feature inputs that don’t minimize the error, and as such only use one-hot encoding.

NOTER_PAGE: (27 0.8350383631713556 . 0.740506329113924)

One-hot encoding is the most commonly-used of the count-based methods.

NOTER_PAGE: (28 0.1112531969309463 . 0.22242314647377937)

new variable for each feature that we have.

NOTER_PAGE: (28 0.12979539641943735 . 0.3670886075949367)

What we’ve built is a standard logistic regression model.

NOTER_PAGE: (28 0.8024296675191817 . 0.22423146473779385)

community has converged on using gradient- boosted decision tree methods for dealing with tabular data,

NOTER_PAGE: (28 0.8228900255754477 . 0.4240506329113924)

neural networks build on simple linear and logistic regression models to generate their output,

NOTER_PAGE: (29 0.09462915601023018 . 0.23056057866184448)

insanely large, sparse vector that has a 0 of occurrence of words in our vocabulary.

NOTER_PAGE: (30 0.6336317135549873 . 0.44394213381555153)

bag of words, or simply the frequency of appearance of text in a given document

NOTER_PAGE: (30 0.6662404092071612 . 0.6175406871609403)

TF-IDF

NOTER_PAGE: (31 . 0.687254)

when we have large amounts of data, we’d like to consider the weights of each term in relation to all the other terms in a collection of documents.

NOTER_PAGE: (31 0.7442455242966752 . 0.39150090415913197)

in the first implementation we separate the corpus words ourselves, don’t remove any stop words, and don’t lowercase everything. Many of these steps are done automatically in scikit-learn or can be set as parameters into the processing pipeline. We’ll see later that these are critical NLP steps that we perform each time we work with text.

NOTER_PAGE: (34 0.7231457800511509 . 0.26853526220614826)

Generally, when we work with textual representations, we’re trying to understand which words, phrases, or concepts are similar to each other.

NOTER_PAGE: (35 0.23785166240409208 . 0.22332730560578662)

The most commonly used approach in most models where we’re trying to ascertain the semantic closeness of two items is cosine similarity, which is the cosine of the angle between two objects represented as vectors,

NOTER_PAGE: (35 0.631074168797954 . 0.22242314647377937)

in very large, sparse spaces, the direction of the vectors is just as, and even more important, than the actual values.

NOTER_PAGE: (35 0.8964194373401535 . 0.5415913200723327)

SVD and PCA

NOTER_PAGE: (37 . 0.506177)

There is a problem with the vectors we created in one-hot encoding and TF- IDF: they are sparse.

NOTER_PAGE: (37 0.5441176470588236 . 0.19439421338155516)

Dense vectors are just vectors that have mostly non-zero values. We call these dense representations dynamic representations

NOTER_PAGE: (37 0.8599744245524298 . 0.3716094032549729)

SVD and PCA are both dimensionality reduction techniques that, applied through matrix transformations to our original text input data, show us the latent relationship between two items

NOTER_PAGE: (38 0.14641943734015347 . 0.22242314647377937)

LDA and LSA

NOTER_PAGE: (38 . 0.627552)

Other approaches grew out of TF-IDF and PCA to address their limitations, including latent semantic analysis (LSA) and latent Dirichlet allocation (LDA)

NOTER_PAGE: (38 0.7500000000000001 . 0.2206148282097649)

words that occur close together more frequently have more important relationships.

NOTER_PAGE: (39 0.09590792838874682 . 0.3670886075949367)

LDA takes a slightly different approach. Although it uses the same matrix for input, it instead outputs a matrix where the rows are words and columns are documents.

NOTER_PAGE: (39 0.20971867007672637 . 0.22513562386980107)

The assumption is that any sentence we input will contain a collection of topics, based on proportions of representation in relation to the input corpus, and that there are a number of topics that we can use to classify a given sentence.

NOTER_PAGE: (39 0.2800511508951407 . 0.2640144665461121)

Limitations of traditional approaches

NOTER_PAGE: (39 . 0.458567)

as our corpus starts to grow, we start to run into two problems: the curse of dimensionality and compute scale.

NOTER_PAGE: (39 0.5652173913043479 . 0.45027124773960214)

The curse of dimensionality

NOTER_PAGE: (39 . 0.606221)

one-hot encodings, in terms of computing performance, are O(n) in the worst case complexity.

NOTER_PAGE: (39 0.8791560102301791 . 0.589511754068716)

the curse of dimensionality, which means that, the more features we accumulate, the more data we need in order to accurately statistically confidently say anything about them,

NOTER_PAGE: (40 0.3906649616368287 . 0.2974683544303797)

Computational complexity

NOTER_PAGE: (40 . 0.486646)

time-complexity of computing the TF-IDF weights for all the terms in all the documents is O( Nd), where N is the total number of terms in the corpus and d is the number of documents in the corpus. Additionally, because TF-IDF creates a matrix as output, what we end up doing is processing enormous state matrices.

NOTER_PAGE: (41 0.11189258312020461 . 0.5641952983725136)

Capturing log data at scale began the rise of the Big Data era,

NOTER_PAGE: (41 0.44757033248081846 . 0.22242314647377937)

Support Vector Machines

NOTER_PAGE: (41 . 0.66135)

The goal of the SVM is to find the optimal hyperplane such that the dis- tance between new projections of objects (words in our case) into the space maximizes the distance between the plane and the elements so there’s less chance of mis-classifying them.

NOTER_PAGE: (41 0.8305626598465474 . 0.22242314647377937)

Because in our sparse vector representations of elements most of the distances are zero, the hyper- plane will fail to cleanly separate the boundaries and classify words incorrectly.

NOTER_PAGE: (42 0.5895140664961638 . 0.6365280289330922)

Word2Vec

NOTER_PAGE: (42 . 0.651915)

focusing not only on the inherent labels of individual words, but on the relationship between those representations.

NOTER_PAGE: (43 0.090153452685422 . 0.4547920433996383)

In training skipgrams, we take a word from the initial input corpus and predict the probability that a given set of words surround it.

NOTER_PAGE: (43 0.2282608695652174 . 0.2197106690777577)

In training CBOW, we do the opposite: we remove a word from the middle of a phrase known as the context window and train a model to predict the probability that a given word fills the blank,

NOTER_PAGE: (43 0.7685421994884911 . 0.21699819168173598)

tokenizes — or creates smaller, word-level representations of each sentence

NOTER_PAGE: (45 0.4980818414322251 . 0.43399638336347196)

This kind of processing pipeline is extremely common in NLP and spend- ing time to get this step right is extremely critical so that we get clean, correct input data.

NOTER_PAGE: (45 0.5581841432225064 . 0.22513562386980107)

next step is to create one-hot encodings of each word to a numerical position, and each position back to a word,

NOTER_PAGE: (46 0.4079283887468031 . 0.40415913200723325)

The embedding layer is a lookup table that matches a word to the corresponding word vector on an index by index basis.

NOTER_PAGE: (46 0.540920716112532 . 0.3010849909584087)

Embeddings resemble hash maps and also have their performance char- acteristics (O(1) retrieval and insert time), which is why they can scale easily

NOTER_PAGE: (46 0.6771099744245525 . 0.22242314647377937)

each value in the vector represents the word on a specific dimension, and more importantly, unlike many of the other methods, the value of each vector is in direct relationship to the other words in the input dataset.

NOTER_PAGE: (46 0.7141943734015346 . 0.64376130198915)

For CBOW, we take a single word and we pick a sliding window, in our case, two words before, and two words after, and try to infer what the actual word is. This is called the context vector, and in other cases, we’ll see that it’s called attention.

NOTER_PAGE: (47 0.5831202046035806 . 0.725135623869801)

Modern Embeddings Approaches

NOTER_PAGE: (50 . 0.155913)

Backpropagation is how a model learns to converge by calculating the gradient of the loss function with respect to the weights of the neural network, using the chain rule, a concept from calculus which allows us to calculate the derivative of a function made up of multiple functions. This mechanism allows the model to understand when it’s reached a global minimum for loss

NOTER_PAGE: (50 0.28388746803069054 . 0.23688969258589512)

Neural Networks

NOTER_PAGE: (51 . 0.221989)

when we start dealing with extremely large, implicit feature spaces, such as are present in text, audio, or video, we will not be able to derive specific features that wouldn’t be obvious if we were manually creating them.

NOTER_PAGE: (51 0.3938618925831202 . 0.3047016274864376)

Neural Network architectures

NOTER_PAGE: (51 . 0.538626)

RNNs and CNNs are used mainly in feature extraction

NOTER_PAGE: (51 0.8510230179028133 . 0.2206148282097649)

Neural networks are complex to build and manage for a number of reasons. First, they require extremely large corpuses of clean, well-labeled data

NOTER_PAGE: (52 0.5044757033248082 . 0.22423146473779385)

These features made developing and running neural networks pro- hibitively expensive until the last fifteen years or so.

NOTER_PAGE: (52 0.6751918158567776 . 0.22694394213381555)

Transformers

NOTER_PAGE: (53 . 0.373738)

doesn’t work well on long ranges of text that require under- standing words in context of each other.

NOTER_PAGE: (53 0.5294117647058824 . 0.3236889692585895)

Word2Vec can’t handle out-of- vocabulary words — words that the model has not been trained on and needs to generalize to.

NOTER_PAGE: (53 0.631074168797954 . 0.3625678119349005)

Word2Vec encounters context collapse around polysemy — the coexistence of many possible meanings for the same phrase:

NOTER_PAGE: (53 0.7295396419437341 . 0.42766726943942135)

popular variation of an RNN that worked around this problem was the long-short term memory network (LSTM),

NOTER_PAGE: (54 0.5242966751918159 . 0.2857142857142857)

While LSTMs worked fairly well, they had their own limitations. Because they were architecturally complicated, they took much longer to train, and at a higher computational cost, because they couldn’t be trained in parallel.

NOTER_PAGE: (54 0.6227621483375959 . 0.24321880650994573)

Encoders/Decoders and Attention

NOTER_PAGE: (54 . 0.701177)

Two concepts allowed researchers to overcome computationally expensive issues with remembering long vectors for a larger context window than what was available in RNNs and Word2Vec before it: the encoder/decoder architecture, and the attention mechanism.

NOTER_PAGE: (54 0.7372122762148339 . 0.19710669077757684)

The encoder/decoder architecture is a neural network architecture com- prised of two neural networks, an encoder that takes the input vectors from our data and creates an embedding of a fixed length, and a decoder, also a neural network, which takes the embeddings encoded as input and generates a static set of outputs such as translated text or a text summary.

NOTER_PAGE: (54 0.8043478260869565 . 0.2206148282097649)

In between the two types of layers is the attention mechanism, a way to hold the state of the entire input by continuously performing weighted matrix multiplica- tions that highlight the relevance of specific terms in relation to each other in the vocabulary.

NOTER_PAGE: (55 0.09590792838874682 . 0.3282097649186257)

We can think of attention as a very large, complex hash table that keeps track of the words in the text and how they map to different representations both in the input and the output.

NOTER_PAGE: (55 0.16240409207161127 . 0.352622061482821)

"Attention is All You Need,"¹ released in 2017, combined both of these concepts into a single architecture.

NOTER_PAGE: (56 0.6579283887468031 . 0.2151898734177215)

The goal of a transformer model is to take a piece of multimodal content, and learn the latent relationships by creating multiple views of groups of words in the input corpus (multiple context windows).

NOTER_PAGE: (57 0.31521739130434784 . 0.19168173598553345)

the next-to-last layer is the model’s embeddings, which we can use for downstream work.

NOTER_PAGE: (57 0.432225063938619 . 0.22513562386980107)

these alone will not help us with context, so, on top of this, we also learn a positional embeddings

NOTER_PAGE: (57 0.7947570332480819 . 0.19529837251356238)

The self-attention layer is the key piece, which performs the process of learning the relationship of each term in relation to the other through scaled dot-product attention. We can think of self-attention in several ways: as a differentiable lookup table, or as a large lookup dictionary that contains both the terms and their positions, with the weights of each term in relationship to the other obtained from previous layers.

NOTER_PAGE: (58 0.17647058823529413 . 0.2703435804701627)

For each embedding, we generate a weighted average value based on these learned attention weights.

NOTER_PAGE: (58 0.3478260869565218 . 0.19529837251356238)

What’s great about scaled dot-product attention (and about all of the layers of the encoder) is that the work can be done in parallel

NOTER_PAGE: (58 0.4974424552429668 . 0.2206148282097649)

BERT

NOTER_PAGE: (59 . 0.503015)

The output of BERT is latent representations of words and their context — a set of embeddings. BERT is, essentially, an enormous parallelized Word2Vec that remembers longer context windows.

NOTER_PAGE: (60 0.24616368286445015 . 0.2206148282097649)

GPT

NOTER_PAGE: (60 . 0.357176)

GPT differs from BERT in that it encodes as well as decodes text from embeddings and therefore can be used for probabilistic inference.

NOTER_PAGE: (60 0.40984654731457804 . 0.5777576853526221)

Embeddings in Production

NOTER_PAGE: (60 . 0.692233)

The model that is deployed is always better and more accurate than the model that is only ever a prototype.

NOTER_PAGE: (61 0.19948849104859337 . 0.2206148282097649)

hard disagree

One of the largest gifts that the transformer architecture gives us is the ability to perform transfer learning.

NOTER_PAGE: (61 0.4008951406649617 . 0.30560578661844484)

Now, we have the ability to treat the output of the layers of BERT as input into the next neural network layer of our own, custom model.

NOTER_PAGE: (61 0.5338874680306905 . 0.22332730560578662)

Embeddings in Practice

NOTER_PAGE: (62 . 0.084175)

NOTER_PAGE: (62 . 0.264902)

YouTube and Google Play Store

NOTER_PAGE: (63 . 0.333707)

Twitter

NOTER_PAGE: (66 . 0.816559)

Embeddings as an Engineering Problem

NOTER_PAGE: (69 . 0.438294)

machine learning workflows add an enormous amount of com- plexity and overhead to our engineering systems,

NOTER_PAGE: (69 0.4769820971867008 . 0.28752260397830015)

they blend data that then needs to be monitored for drift

NOTER_PAGE: (69 0.5121483375959079 . 0.2857142857142857)

non-deterministic in their outputs,

NOTER_PAGE: (69 0.5287723785166241 . 0.39511754068716093)

processing pipeline jungles.

NOTER_PAGE: (69 0.5639386189258313 . 0.5325497287522604)

Embeddings Generation

NOTER_PAGE: (71 . 0.380917)

In fine-tuning a model, we perform all the same steps as we do for training from scratch. We have training data, we have a model, and we minimize a loss function. However, there are several differences. When we create our new model, we copy the existing, pre-trained model with the exception of the final output layer, which we initialize from scratch based on our new task.

NOTER_PAGE: (72 0.159846547314578 . 0.2197106690777577)

BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and FastText

NOTER_PAGE: (72 0.3657289002557545 . 0.2396021699819168)

Storage and Retrieval

NOTER_PAGE: (72 . 0.512613)

Drift Detection, Versioning, and Interpretability

NOTER_PAGE: (74 . 0.171293)

Inference and Latency

NOTER_PAGE: (75 . 0.306481)

Online and Offline Model Evaluation

NOTER_PAGE: (76 . 0.084175)

What makes embeddings projects successful

NOTER_PAGE: (76 . 0.349394)

Conclusion

NOTER_PAGE: (76 . 0.819141)

Footnotes:

Ashish Vaswani et al., “Attention Is All You Need,” prepublished December 5, 2017, https://doi.org/10.48550/arXiv.1706.03762.

Notes that link to this note

Machine Learning

Notes

Introduction

embeddings — deep learning models’ internal representations of their input data

The usage of embeddings to generate compressed, context-specific repre- sentations of content exploded in popularity after the publication of Google’s Word2Vec paper

the concept of embeddings can be elusive because they’re neither data flow inputs or output results - they are intermediate elements that live within machine learning services

embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations.

Transforms multimodal input into representations that are easier to perform intensive computation on, in the form of vectors, tensors, or graphs

Compresses input information

changes variable feature dimensions into fixed inputs,

Creates an embedding space that is specific to the data

can also generalize to other tasks and domains through transfer learning

Generally, we represent individual embeddings as row vectors.

tensor, also known as a matrix, which is a multidimensional combination of vector representations of multiple elements.

These embeddings are the output of the process of learning embeddings, which we do by passing raw input data into a machine learning model.

transform that multidimensional input data by compressing it, through the algorithms we discuss in this paper, into a lower-dimensional space. The result is a set of vectors in an embedding space.

talk about item embeddings being in X dimensions, ranging anywhere from 100 to 1000, with diminishing returns in usefulness somewhere beyond 200-300

One em- bedding layer is computed for each layer of the neural network. Each level represents a different view of our given token

We can get the final embedding by pooling several layers,

we are often interested in comparing two given items to see how similar they are.

Engineering systems based on embeddings can be computationally expensive to build and maintain

Recommendation as a business problem

How do we solve the problem of what to show in the timeline here so that our users find the content relevant and interesting, and balance the needs of our advertisers and business partners?

that is relevant, interesting, and novel so they continue to use the platform. If we do not build discovery and personalization into our content-centric product, Flutter users will not be able to discover more content to consume and will disengage from the platform.

needs a recommender system most when they are not sure what they want to watch.

Building a web app

Rules-based systems versus machine learning

In these systems, we don’t start with business logic. We start with input data that we use to build a model that will suggest the business logic for us.

Building a web app with machine learning

Feature Engineering and Selection - The process of examining the data and cleaning it to pick features.

This piece always takes the longest

We select the features that are important and train our model,

Embeddings are also the output of this step

supervised, where we have training data that can tell us whether the results the model predicted are correct

unsupervised, where there is not a single ground-truth answer.

Formulating a machine learning problem

A machine learning model is a set of instructions for generating a given output from data.

we have a UID (userid) and some attributes of that user, such as the number of times they’ve posted and number of posts they’ve liked. These are our machine learning features.

We take two parts of this data as holdout data that we don’t feed into the model. The first part, the test set, we use to validate the final model on data it’s never seen before. We use the second split, called the validation set, to check our hyperparameters during the model training phase.

usual accepted split is to use 80% of data for training and 20% for testing.

How do we know our model is good? We initialize it with some set of values, weights, and we iterate on those weights, usually by minimizing a cost function. The cost function is a function that models the difference between our model’s predicted value and the actual output for the training data.

The average squared difference between an observation’s actual and predicted values is the cost, otherwise known as MSE - mean squared error.

We’d like to minimize this cost, and we do so with gradient descent.

our loss should incrementally decrease in every training iteration.

The Task of Recommendations

The goal of information retrieval is to synthesize large collections of unstructured text documents.

Within information retrieval, there are two complementary solutions in how we can offer users the correct content in our app: search, and recommendations.

The goal of recommender systems is surface items that are relevant to the user.

Collaborative filtering - The most common approach for creating recommendations is to formulate our data as a problem of finding missing user-item interactions in a given set of user-item interaction history.

Content filtering - This approach uses metadata available about our items (for example in movies or music, the title, year released, genre, and so on) as initial or additional features input into models and work well when we don’t have much information about user activ- ity,

Many embeddings architectures fall into this category

Neural Recommendations - The process of using neural networks to capture the same relationships that matrix factorization does without explicitly having to create a user/item matrix and based on the shape of the input data.

deep learning architectures used for recommendation include Word2Vec and BERT,

Embeddings are a type of machine learning feature — or model input data — that we use first as input into the feature engineering stage, and the first set of results that come from our candidate generation stage,

Machine learning features

The process of formatting data correctly to feed into a model is called fea- ture engineering.

When we have a single continuous, numerical feature, like “the age of the flit in days”, it’s easy to feed these features into a model. But, when we have textual data, we need to turn it into numerical representations so that we can compare these representations.

Numerical Feature Vectors

Within the context of working with text in machine learning, we represent features as numerical vectors.

When we create vectors, we can run mathematical computations over them and use them as inputs into ML models in the numerical form we require.

From Words to Vectors in Three Easy Pieces

several fundamental concepts that make up the work of transforming words to numerical representations. These show up over and over again, in every deep learning architecture and every NLP-related task

Encoding - We need to represent our non-numerical, multimodal data as numbers

Vectors - we need a way to store the data we have encoded and have the ability to perform mathematical functions in an optimized way on them.

we use lookup tables, also known as hash tables, also known as attention, to help us map between the words and the numbers.

Historical Encoding Approaches

Early Approaches

Encoding

Ordinal encoding

We can use this method only if the variables have a natural ordered relationship to each other.

Indicator and one-hot encoding

given n categories (i.e. "US", "UK", and "NZ"), encodes the variables into n − 1 categories, creating a new feature for each category.

many modern ML approaches don’t require linear feature independence and use L1 regularization17 to prune feature inputs that don’t minimize the error, and as such only use one-hot encoding.

One-hot encoding is the most commonly-used of the count-based methods.

new variable for each feature that we have.

What we’ve built is a standard logistic regression model.

community has converged on using gradient- boosted decision tree methods for dealing with tabular data,

neural networks build on simple linear and logistic regression models to generate their output,

insanely large, sparse vector that has a 0 of occurrence of words in our vocabulary.

bag of words, or simply the frequency of appearance of text in a given document

TF-IDF

"Attention is All You Need,"¹ released in 2017, combined both of these concepts into a single architecture.