Word and Graph Embeddings for Machine Learning

Orateur : Steven Skiena
19 Avril 2022 à 14:00 ; lieu : Seminar room 4B125 (Copernic building)

Distributed word embeddings (e.g. word2vec) provide a powerful way to reduce large text corpora to concise features (vectors) readily applicable to a variety of problems in NLP and data science. I will introduce word embeddings, and apply them in variety of new and interesting directions, including:

(1) Multilingual NLP — The Polyglot project (www.polyglot-NLP.com) employs deep learning and other techniques to build a basic NLP pipeline (including entity recognition, POS tagging, and sentiment analysis) for over 100 different languages. We train our systems over each language’s Wikipedia edition, providing unified data resources in the absence of explicitly annotated data, but substantial challenges in interpretation and evaluation.

(2) Detecting Historical Shifts in Word Meaning — Words like « gay » and « mouse » have substantially shifted their meanings over time in response to societal and technological changes. We use word embeddings trained over texts drawn from different time periods to detect changes in word meanings. This is part of our efforts in historical trends analysis.

(3) Feature Extraction from Graphs — We present DeepWalk, our approach for learning latent representations of vertices in a network, which has become extremely popular. DeepWalk uses local information on truncated random walks to learn embeddings, by treating walks as the equivalent of sentences in a language. It is suitable for a broad class of applications such as network classification and anomaly detection. We also introduce new graph embedding techniques based on random projections, which produce DeepWalk-quality embeddings thousands of times faster than previous algorithms.


Seminar room 4B125 (Copernic building)

5 Boulevard Descartes 77420 Champs-sur-Marne