Unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Install / Use
/learn @microsoft/UnilmREADME
aka.ms/GeneralAI
Hiring
We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to <a href="mailto:fuwei@microsoft.com" class="x-hidden-focus">fuwei@microsoft.com</a>.
Foundation Architecture
TorchScale - A Library of Foundation Architectures (repo)
Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.
Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
Capability - A Length-Extrapolatable Transformer
Efficiency & Transferability - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)
The Revolution of Model Architecture
BitNet: 1-bit Transformers for Large Language Models
RetNet: Retentive Network: A Successor to Transformer for Large Language Models
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Foundation Models
The Evolution of (M)LLM (Multimodal LLM)
Kosmos-2.5: A Multimodal Literate Model
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-1: A Multimodal Large Language Model (MLLM)
MetaLM: Language Models are General-Purpose Interfaces
The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)
Language & Multilingual
UniLM: unified pre-training for language understanding and generation
InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages
DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages
MiniLM: small and fast pre-trained models for language understanding and generation
AdaLM: domain, language, and task adaptation of pre-trained models
EdgeLM(
NEW): small pre-trained models on edge/client devices
SimLM (
NEW): large-scale pre-training for similarity matching
E5 (
NEW): text embeddings
MiniLLM (
NEW): Knowledge Distillation of Large Language Models
Vision
BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers
DiT: self-supervised pre-training for Document Image Transformers
TextDiffuser/TextDiffuser-2 (
NEW): Diffusion Models as Text Painters
Speech
WavLM: speech pre-training for full stack tasks
VALL-E: a neural codec language model for TTS
Multimodal (X + Language)
LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image) Document Foundation Model for Document AI (e.g. scanned documents, PDF, etc.)
LayoutXLM: multimodal (text + layout/format + image) Document Foundation Model for multilingual Document AI
MarkupLM: markup language model pre-training for visually-rich document understanding
XDoc: unified pre-training for cross-format document understanding
UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR
UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training
SpeechT5: encoder-decoder pre-training for spoken language processing
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
VLMo: Unified vision-language pre-training
VL-BEiT (
NEW): Generative Vision-Language Pre-training - evolution of BEiT to multimodal
BEiT-3 (
NEW): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.
Toolkits
s2s-ft: sequence-to-sequence fine-tuning toolkit
Aggressive Decoding (
NEW): lossless and efficient sequence-to-sequence decoding algorithm
Applications
TrOCR: transformer-based OCR w/ pre-trained models
LayoutReader: pre-training of text and layout for reading order detection
XLM-T: multilingual NMT w/ pretrained cross-lingual encoders
Links
LLMOps (repo)
General technology for enabling AI capabilities w/ LLMs and MLLMs.
RedStone (repo)
Curating General, Code, Math, and QA Data for Large Language Models.
News
- December, 2024: RedStone was released!
- December, 2023: LongNet and LongViT released
- [Model Release] Dec, 2023: TextDiffuser-2 models, code and demo.
- Sep, 2023: Kosmos-2.5 - a multimodal literate model for machine reading of text-intensive images.
- [Model Release] May, 2023: TextDiffuser models and code.
- [Model Release] March, 2023: BEiT-3 pretrained models and code.
- March, 2023: Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
- January, 2023: VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
- [Model Release] January, 2023: E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
- November, 2022: TorchScale 0.1.1 was released!
- November, 2022: TrOCR was accepted by AAAI 2023.
- [Model Release] November, 2022: XDoc BASE models for cross-format document understanding.
- [Model Release] September, 2022: TrOCR BASE and LARGE models for Scene Text Recognition (STR).
- [Model Release] September, 2022: BEiT v2 code and pretrained models.
- August, 2022: BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
- July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
- June, 2022: [**DiT
