- tags
- Machine Learning
Notes
Introduction
NOTER_PAGE: (4 . 0.084175)
NOTER_PAGE: (4 0.19250936329588017 . 0.3538135593220339)
The usage of embeddings to generate compressed, context-specific repre- sentations of content exploded in popularity after the publication of Google’s Word2Vec paper
NOTER_PAGE: (4 0.6794007490636704 . 0.22033898305084745)
NOTER_PAGE: (5 0.4719101123595506 . 0.2997881355932203)
NOTER_PAGE: (5 0.5468164794007491 . 0.4163135593220339)
NOTER_PAGE: (5 0.6134831460674157 . 0.25)
NOTER_PAGE: (5 0.6816479400749064 . 0.24152542372881355)
NOTER_PAGE: (5 0.748314606741573 . 0.4724576271186441)
Creates an embedding space that is specific to the data
NOTER_PAGE: (5 0.8029962546816479 . 0.2510593220338983)
can also generalize to other tasks and domains through transfer learning
NOTER_PAGE: (5 0.8389513108614233 . 0.2521186440677966)
Generally, we represent individual embeddings as row vectors.
NOTER_PAGE: (6 0.14681647940074907 . 0.583686440677966)
tensor, also known as a matrix, which is a multidimensional combination of vector representations of multiple elements.
NOTER_PAGE: (6 0.2202247191011236 . 0.3273305084745763)
NOTER_PAGE: (6 0.32359550561797756 . 0.22033898305084745)
NOTER_PAGE: (6 0.3595505617977528 . 0.1906779661016949)
talk about item embeddings being in X dimensions, ranging anywhere from 100 to 1000, with diminishing returns in usefulness somewhere beyond 200-300
NOTER_PAGE: (6 0.550561797752809 . 0.3029661016949153)
One em- bedding layer is computed for each layer of the neural network. Each level represents a different view of our given token
NOTER_PAGE: (8 0.10786516853932585 . 0.5508474576271186)
We can get the final embedding by pooling several layers,
NOTER_PAGE: (8 0.16254681647940075 . 0.288135593220339)
we are often interested in comparing two given items to see how similar they are.
NOTER_PAGE: (8 0.26741573033707866 . 0.4830508474576271)
Engineering systems based on embeddings can be computationally expensive to build and maintain
NOTER_PAGE: (8 0.700374531835206 . 0.21927966101694915)
Recommendation as a business problem
NOTER_PAGE: (9 . 0.662869)
How do we solve the problem of what to show in the timeline here so that our users find the content relevant and interesting, and balance the needs of our advertisers and business partners?
NOTER_PAGE: (10 0.5835205992509364 . 0.22139830508474576)
is this really what you want to work on?
that is relevant, interesting, and novel so they continue to use the platform. If we do not build discovery and personalization into our content-centric product, Flutter users will not be able to discover more content to consume and will disengage from the platform.
NOTER_PAGE: (10 0.7423220973782771 . 0.1885593220338983)
god speed
needs a recommender system most when they are not sure what they want to watch.
NOTER_PAGE: (11 0.35805243445692886 . 0.23940677966101695)
Building a web app
NOTER_PAGE: (11 . 0.58836)
Rules-based systems versus machine learning
NOTER_PAGE: (13 . 0.260702)
NOTER_PAGE: (13 0.6674157303370787 . 0.3718220338983051)
Building a web app with machine learning
NOTER_PAGE: (15 . 0.397656)
Feature Engineering and Selection - The process of examining the data and cleaning it to pick features.
NOTER_PAGE: (15 0.6112359550561798 . 0.2510593220338983)
This piece always takes the longest
NOTER_PAGE: (15 0.6951310861423221 . 0.3305084745762712)
We select the features that are important and train our model,
NOTER_PAGE: (15 0.7617977528089888 . 0.3061440677966102)
Embeddings are also the output of this step
NOTER_PAGE: (15 0.7932584269662921 . 0.475635593220339)
supervised, where we have training data that can tell us whether the results the model predicted are correct
NOTER_PAGE: (16 0.5166240409207161 . 0.26582278481012656)
unsupervised, where there is not a single ground-truth answer.
NOTER_PAGE: (16 0.5492327365728901 . 0.30560578661844484)
NOTER_PAGE: (17 . 0.37471)
A machine learning model is a set of instructions for generating a given output from data.
NOTER_PAGE: (17 0.4993606138107417 . 0.2197106690777577)
we have a UID (userid) and some attributes of that user, such as the number of times they’ve posted and number of posts they’ve liked. These are our machine learning features.
NOTER_PAGE: (17 0.7557544757033249 . 0.3236889692585895)
We take two parts of this data as holdout data that we don’t feed into the model. The first part, the test set, we use to validate the final model on data it’s never seen before. We use the second split, called the validation set, to check our hyperparameters during the model training phase.
NOTER_PAGE: (18 0.2749360613810742 . 0.2179023508137432)
usual accepted split is to use 80% of data for training and 20% for testing.
NOTER_PAGE: (18 0.39514066496163686 . 0.5488245931283906)
How do we know our model is good? We initialize it with some set of values, weights, and we iterate on those weights, usually by minimizing a cost function. The cost function is a function that models the difference between our model’s predicted value and the actual output for the training data.
NOTER_PAGE: (19 0.6138107416879796 . 0.2260397830018083)
The average squared difference between an observation’s actual and predicted values is the cost, otherwise known as MSE - mean squared error.
NOTER_PAGE: (19 0.7320971867007673 . 0.6428571428571428)
We’d like to minimize this cost, and we do so with gradient descent.
NOTER_PAGE: (19 0.8369565217391305 . 0.22694394213381555)
our loss should incrementally decrease in every training iteration.
NOTER_PAGE: (20 0.09782608695652174 . 0.6428571428571428)
The Task of Recommendations
NOTER_PAGE: (20 . 0.221562)
The goal of information retrieval is to synthesize large collections of unstructured text documents.
NOTER_PAGE: (20 0.34207161125319696 . 0.5524412296564195)
Within information retrieval, there are two complementary solutions in how we can offer users the correct content in our app: search, and recommendations.
NOTER_PAGE: (20 0.3548593350383632 . 0.7106690777576853)
The goal of recommender systems is surface items that are relevant to the user.
NOTER_PAGE: (20 0.6649616368286445 . 0.2197106690777577)
“relevant”?
Collaborative filtering - The most common approach for creating recommendations is to formulate our data as a problem of finding missing user-item interactions in a given set of user-item interaction history.
NOTER_PAGE: (20 0.7691815856777494 . 0.24954792043399637)
Content filtering - This approach uses metadata available about our items (for example in movies or music, the title, year released, genre, and so on) as initial or additional features input into models and work well when we don’t have much information about user activ- ity,
NOTER_PAGE: (21 0.25191815856777494 . 0.2513562386980108)
Many embeddings architectures fall into this category
NOTER_PAGE: (21 0.3369565217391305 . 0.2965641952983725)
NOTER_PAGE: (21 0.48593350383631717 . 0.24954792043399637)
deep learning architectures used for recommendation include Word2Vec and BERT,
NOTER_PAGE: (21 0.5575447570332481 . 0.6835443037974683)
NOTER_PAGE: (22 0.4462915601023018 . 0.22423146473779385)
Machine learning features
NOTER_PAGE: (22 . 0.540042)
NOTER_PAGE: (23 0.4993606138107417 . 0.22242314647377937)
When we have a single continuous, numerical feature, like “the age of the flit in days”, it’s easy to feed these features into a model. But, when we have textual data, we need to turn it into numerical representations so that we can compare these representations.
NOTER_PAGE: (23 0.5121483375959079 . 0.3426763110307414)
Numerical Feature Vectors
NOTER_PAGE: (23 . 0.590726)
Within the context of working with text in machine learning, we represent features as numerical vectors.
NOTER_PAGE: (23 0.6253196930946292 . 0.20072332730560577)
NOTER_PAGE: (23 0.6937340153452686 . 0.7115732368896925)
From Words to Vectors in Three Easy Pieces
NOTER_PAGE: (24 . 0.400817)
NOTER_PAGE: (24 0.5051150895140666 . 0.3887884267631103)
Encoding - We need to represent our non-numerical, multimodal data as numbers
NOTER_PAGE: (24 0.5959079283887468 . 0.2504520795660036)
NOTER_PAGE: (24 0.6425831202046036 . 0.2513562386980108)
we use lookup tables, also known as hash tables, also known as attention, to help us map between the words and the numbers.
NOTER_PAGE: (24 0.7787723785166241 . 0.45027124773960214)
Historical Encoding Approaches
NOTER_PAGE: (25 . 0.580465)
Early Approaches
NOTER_PAGE: (26 . 0.419119)
Encoding
NOTER_PAGE: (26 . 0.801609)
Ordinal encoding
NOTER_PAGE: (26 0.8356777493606139 . 0.19077757685352623)
We can use this method only if the variables have a natural ordered relationship to each other.
NOTER_PAGE: (26 0.8702046035805627 . 0.2576853526220615)
Indicator and one-hot encoding
NOTER_PAGE: (27 . 0.57756)
given n categories (i.e. "US", "UK", and "NZ"), encodes the variables into n − 1 categories, creating a new feature for each category.
NOTER_PAGE: (27 0.6189258312020461 . 0.3589511754068716)
NOTER_PAGE: (27 0.8350383631713556 . 0.740506329113924)
One-hot encoding is the most commonly-used of the count-based methods.
NOTER_PAGE: (28 0.1112531969309463 . 0.22242314647377937)
new variable for each feature that we have.
NOTER_PAGE: (28 0.12979539641943735 . 0.3670886075949367)
What we’ve built is a standard logistic regression model.
NOTER_PAGE: (28 0.8024296675191817 . 0.22423146473779385)
NOTER_PAGE: (28 0.8228900255754477 . 0.4240506329113924)
neural networks build on simple linear and logistic regression models to generate their output,
NOTER_PAGE: (29 0.09462915601023018 . 0.23056057866184448)
insanely large, sparse vector that has a 0 of occurrence of words in our vocabulary.
NOTER_PAGE: (30 0.6336317135549873 . 0.44394213381555153)
bag of words, or simply the frequency of appearance of text in a given document
NOTER_PAGE: (30 0.6662404092071612 . 0.6175406871609403)
TF-IDF
NOTER_PAGE: (31 . 0.687254)
when we have large amounts of data, we’d like to consider the weights of each term in relation to all the other terms in a collection of documents.
NOTER_PAGE: (31 0.7442455242966752 . 0.39150090415913197)
in the first implementation we separate the corpus words ourselves, don’t remove any stop words, and don’t lowercase everything. Many of these steps are done automatically in scikit-learn or can be set as parameters into the processing pipeline. We’ll see later that these are critical NLP steps that we perform each time we work with text.
NOTER_PAGE: (34 0.7231457800511509 . 0.26853526220614826)
Generally, when we work with textual representations, we’re trying to understand which words, phrases, or concepts are similar to each other.
NOTER_PAGE: (35 0.23785166240409208 . 0.22332730560578662)
The most commonly used approach in most models where we’re trying to ascertain the semantic closeness of two items is cosine similarity, which is the cosine of the angle between two objects represented as vectors,
NOTER_PAGE: (35 0.631074168797954 . 0.22242314647377937)
in very large, sparse spaces, the direction of the vectors is just as, and even more important, than the actual values.
NOTER_PAGE: (35 0.8964194373401535 . 0.5415913200723327)
SVD and PCA
NOTER_PAGE: (37 . 0.506177)
There is a problem with the vectors we created in one-hot encoding and TF- IDF: they are sparse.
NOTER_PAGE: (37 0.5441176470588236 . 0.19439421338155516)
Dense vectors are just vectors that have mostly non-zero values. We call these dense representations dynamic representations
NOTER_PAGE: (37 0.8599744245524298 . 0.3716094032549729)
SVD and PCA are both dimensionality reduction techniques that, applied through matrix transformations to our original text input data, show us the latent relationship between two items
NOTER_PAGE: (38 0.14641943734015347 . 0.22242314647377937)
LDA and LSA
NOTER_PAGE: (38 . 0.627552)
Other approaches grew out of TF-IDF and PCA to address their limitations, including latent semantic analysis (LSA) and latent Dirichlet allocation (LDA)
NOTER_PAGE: (38 0.7500000000000001 . 0.2206148282097649)
words that occur close together more frequently have more important relationships.
NOTER_PAGE: (39 0.09590792838874682 . 0.3670886075949367)
NOTER_PAGE: (39 0.20971867007672637 . 0.22513562386980107)
NOTER_PAGE: (39 0.2800511508951407 . 0.2640144665461121)
Limitations of traditional approaches
NOTER_PAGE: (39 . 0.458567)
as our corpus starts to grow, we start to run into two problems: the curse of dimensionality and compute scale.
NOTER_PAGE: (39 0.5652173913043479 . 0.45027124773960214)
The curse of dimensionality
NOTER_PAGE: (39 . 0.606221)
NOTER_PAGE: (39 0.8791560102301791 . 0.589511754068716)
the curse of dimensionality, which means that, the more features we accumulate, the more data we need in order to accurately statistically confidently say anything about them,
NOTER_PAGE: (40 0.3906649616368287 . 0.2974683544303797)
Computational complexity
NOTER_PAGE: (40 . 0.486646)
time-complexity of computing the TF-IDF weights for all the terms in all the documents is O( Nd), where N is the total number of terms in the corpus and d is the number of documents in the corpus. Additionally, because TF-IDF creates a matrix as output, what we end up doing is processing enormous state matrices.
NOTER_PAGE: (41 0.11189258312020461 . 0.5641952983725136)
Capturing log data at scale began the rise of the Big Data era,
NOTER_PAGE: (41 0.44757033248081846 . 0.22242314647377937)
Support Vector Machines
NOTER_PAGE: (41 . 0.66135)
The goal of the SVM is to find the optimal hyperplane such that the dis- tance between new projections of objects (words in our case) into the space maximizes the distance between the plane and the elements so there’s less chance of mis-classifying them.
NOTER_PAGE: (41 0.8305626598465474 . 0.22242314647377937)
Because in our sparse vector representations of elements most of the distances are zero, the hyper- plane will fail to cleanly separate the boundaries and classify words incorrectly.
NOTER_PAGE: (42 0.5895140664961638 . 0.6365280289330922)
Word2Vec
NOTER_PAGE: (42 . 0.651915)
focusing not only on the inherent labels of individual words, but on the relationship between those representations.
NOTER_PAGE: (43 0.090153452685422 . 0.4547920433996383)
NOTER_PAGE: (43 0.2282608695652174 . 0.2197106690777577)
In training CBOW, we do the opposite: we remove a word from the middle of a phrase known as the context window and train a model to predict the probability that a given word fills the blank,
NOTER_PAGE: (43 0.7685421994884911 . 0.21699819168173598)
tokenizes — or creates smaller, word-level representations of each sentence
NOTER_PAGE: (45 0.4980818414322251 . 0.43399638336347196)
NOTER_PAGE: (45 0.5581841432225064 . 0.22513562386980107)
next step is to create one-hot encodings of each word to a numerical position, and each position back to a word,
NOTER_PAGE: (46 0.4079283887468031 . 0.40415913200723325)
The embedding layer is a lookup table that matches a word to the corresponding word vector on an index by index basis.
NOTER_PAGE: (46 0.540920716112532 . 0.3010849909584087)
NOTER_PAGE: (46 0.6771099744245525 . 0.22242314647377937)
NOTER_PAGE: (46 0.7141943734015346 . 0.64376130198915)
For CBOW, we take a single word and we pick a sliding window, in our case, two words before, and two words after, and try to infer what the actual word is. This is called the context vector, and in other cases, we’ll see that it’s called attention.
NOTER_PAGE: (47 0.5831202046035806 . 0.725135623869801)
Modern Embeddings Approaches
NOTER_PAGE: (50 . 0.155913)
Backpropagation is how a model learns to converge by calculating the gradient of the loss function with respect to the weights of the neural network, using the chain rule, a concept from calculus which allows us to calculate the derivative of a function made up of multiple functions. This mechanism allows the model to understand when it’s reached a global minimum for loss
NOTER_PAGE: (50 0.28388746803069054 . 0.23688969258589512)
Neural Networks
NOTER_PAGE: (51 . 0.221989)
when we start dealing with extremely large, implicit feature spaces, such as are present in text, audio, or video, we will not be able to derive specific features that wouldn’t be obvious if we were manually creating them.
NOTER_PAGE: (51 0.3938618925831202 . 0.3047016274864376)
Neural Network architectures
NOTER_PAGE: (51 . 0.538626)
RNNs and CNNs are used mainly in feature extraction
NOTER_PAGE: (51 0.8510230179028133 . 0.2206148282097649)
Neural networks are complex to build and manage for a number of reasons. First, they require extremely large corpuses of clean, well-labeled data
NOTER_PAGE: (52 0.5044757033248082 . 0.22423146473779385)
These features made developing and running neural networks pro- hibitively expensive until the last fifteen years or so.
NOTER_PAGE: (52 0.6751918158567776 . 0.22694394213381555)
NOTER_PAGE: (53 . 0.373738)
doesn’t work well on long ranges of text that require under- standing words in context of each other.
NOTER_PAGE: (53 0.5294117647058824 . 0.3236889692585895)
Word2Vec can’t handle out-of- vocabulary words — words that the model has not been trained on and needs to generalize to.
NOTER_PAGE: (53 0.631074168797954 . 0.3625678119349005)
Word2Vec encounters context collapse around polysemy — the coexistence of many possible meanings for the same phrase:
NOTER_PAGE: (53 0.7295396419437341 . 0.42766726943942135)
popular variation of an RNN that worked around this problem was the long-short term memory network (LSTM),
NOTER_PAGE: (54 0.5242966751918159 . 0.2857142857142857)
While LSTMs worked fairly well, they had their own limitations. Because they were architecturally complicated, they took much longer to train, and at a higher computational cost, because they couldn’t be trained in parallel.
NOTER_PAGE: (54 0.6227621483375959 . 0.24321880650994573)
Encoders/Decoders and Attention
NOTER_PAGE: (54 . 0.701177)
Two concepts allowed researchers to overcome computationally expensive issues with remembering long vectors for a larger context window than what was available in RNNs and Word2Vec before it: the encoder/decoder architecture, and the attention mechanism.
NOTER_PAGE: (54 0.7372122762148339 . 0.19710669077757684)
The encoder/decoder architecture is a neural network architecture com- prised of two neural networks, an encoder that takes the input vectors from our data and creates an embedding of a fixed length, and a decoder, also a neural network, which takes the embeddings encoded as input and generates a static set of outputs such as translated text or a text summary.
NOTER_PAGE: (54 0.8043478260869565 . 0.2206148282097649)
NOTER_PAGE: (55 0.09590792838874682 . 0.3282097649186257)
We can think of attention as a very large, complex hash table that keeps track of the words in the text and how they map to different representations both in the input and the output.
NOTER_PAGE: (55 0.16240409207161127 . 0.352622061482821)
NOTER_PAGE: (56 0.6579283887468031 . 0.2151898734177215)
The goal of a transformer model is to take a piece of multimodal content, and learn the latent relationships by creating multiple views of groups of words in the input corpus (multiple context windows).
NOTER_PAGE: (57 0.31521739130434784 . 0.19168173598553345)
the next-to-last layer is the model’s embeddings, which we can use for downstream work.
NOTER_PAGE: (57 0.432225063938619 . 0.22513562386980107)
these alone will not help us with context, so, on top of this, we also learn a positional embeddings
NOTER_PAGE: (57 0.7947570332480819 . 0.19529837251356238)
NOTER_PAGE: (58 0.17647058823529413 . 0.2703435804701627)
For each embedding, we generate a weighted average value based on these learned attention weights.
NOTER_PAGE: (58 0.3478260869565218 . 0.19529837251356238)
What’s great about scaled dot-product attention (and about all of the layers of the encoder) is that the work can be done in parallel
NOTER_PAGE: (58 0.4974424552429668 . 0.2206148282097649)
BERT
NOTER_PAGE: (59 . 0.503015)
The output of BERT is latent representations of words and their context — a set of embeddings. BERT is, essentially, an enormous parallelized Word2Vec that remembers longer context windows.
NOTER_PAGE: (60 0.24616368286445015 . 0.2206148282097649)
GPT
NOTER_PAGE: (60 . 0.357176)
GPT differs from BERT in that it encodes as well as decodes text from embeddings and therefore can be used for probabilistic inference.
NOTER_PAGE: (60 0.40984654731457804 . 0.5777576853526221)
Embeddings in Production
NOTER_PAGE: (60 . 0.692233)
The model that is deployed is always better and more accurate than the model that is only ever a prototype.
NOTER_PAGE: (61 0.19948849104859337 . 0.2206148282097649)
hard disagree
NOTER_PAGE: (61 0.4008951406649617 . 0.30560578661844484)
NOTER_PAGE: (61 0.5338874680306905 . 0.22332730560578662)
Embeddings in Practice
NOTER_PAGE: (62 . 0.084175)
Pinterest
NOTER_PAGE: (62 . 0.264902)
YouTube and Google Play Store
NOTER_PAGE: (63 . 0.333707)
NOTER_PAGE: (66 . 0.816559)
Embeddings as an Engineering Problem
NOTER_PAGE: (69 . 0.438294)
NOTER_PAGE: (69 0.4769820971867008 . 0.28752260397830015)
they blend data that then needs to be monitored for drift
NOTER_PAGE: (69 0.5121483375959079 . 0.2857142857142857)
non-deterministic in their outputs,
NOTER_PAGE: (69 0.5287723785166241 . 0.39511754068716093)
processing pipeline jungles.
NOTER_PAGE: (69 0.5639386189258313 . 0.5325497287522604)
Embeddings Generation
NOTER_PAGE: (71 . 0.380917)
NOTER_PAGE: (72 0.159846547314578 . 0.2197106690777577)
BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and FastText
NOTER_PAGE: (72 0.3657289002557545 . 0.2396021699819168)
Storage and Retrieval
NOTER_PAGE: (72 . 0.512613)
Drift Detection, Versioning, and Interpretability
NOTER_PAGE: (74 . 0.171293)
Inference and Latency
NOTER_PAGE: (75 . 0.306481)
Online and Offline Model Evaluation
NOTER_PAGE: (76 . 0.084175)
What makes embeddings projects successful
NOTER_PAGE: (76 . 0.349394)
Conclusion
NOTER_PAGE: (76 . 0.819141)
Notes that link to this note