aka.ms/GeneralAI

Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to <a href="mailto:fuwei@microsoft.com" class="x-hidden-focus">fuwei@microsoft.com</a>.

Foundation Architecture

TorchScale - A Library of Foundation Architectures (repo)

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond

Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

Capability - A Length-Extrapolatable Transformer

Efficiency & Transferability - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

The Revolution of Model Architecture

BitNet: 1-bit Transformers for Large Language Models

RetNet: Retentive Network: A Successor to Transformer for Large Language Models

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Foundation Models

The Evolution of (M)LLM (Multimodal LLM)

Kosmos-2.5: A Multimodal Literate Model

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-1: A Multimodal Large Language Model (MLLM)

MetaLM: Language Models are General-Purpose Interfaces

The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

Language & Multilingual

UniLM: unified pre-training for language understanding and generation

InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages

DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages

MiniLM: small and fast pre-trained models for language understanding and generation

AdaLM: domain, language, and task adaptation of pre-trained models

EdgeLM(NEW): small pre-trained models on edge/client devices

SimLM (NEW): large-scale pre-training for similarity matching

E5 (NEW): text embeddings

MiniLLM (NEW): Knowledge Distillation of Large Language Models

Vision

BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

DiT: self-supervised pre-training for Document Image Transformers

TextDiffuser/TextDiffuser-2 (NEW): Diffusion Models as Text Painters

Speech

WavLM: speech pre-training for full stack tasks

VALL-E: a neural codec language model for TTS

Multimodal (X + Language)

LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image) Document Foundation Model for Document AI (e.g. scanned documents, PDF, etc.)

LayoutXLM: multimodal (text + layout/format + image) Document Foundation Model for multilingual Document AI

MarkupLM: markup language model pre-training for visually-rich document understanding

XDoc: unified pre-training for cross-format document understanding

UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR

UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training

SpeechT5: encoder-decoder pre-training for spoken language processing

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

VLMo: Unified vision-language pre-training

VL-BEiT (NEW): Generative Vision-Language Pre-training - evolution of BEiT to multimodal

BEiT-3 (NEW): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.

Toolkits

s2s-ft: sequence-to-sequence fine-tuning toolkit

Aggressive Decoding (NEW): lossless and efficient sequence-to-sequence decoding algorithm

Applications

TrOCR: transformer-based OCR w/ pre-trained models

LayoutReader: pre-training of text and layout for reading order detection

XLM-T: multilingual NMT w/ pretrained cross-lingual encoders

News

December, 2024: RedStone was released!
December, 2023: LongNet and LongViT released
[Model Release] Dec, 2023: TextDiffuser-2 models, code and demo.
Sep, 2023: Kosmos-2.5 - a multimodal literate model for machine reading of text-intensive images.
[Model Release] May, 2023: TextDiffuser models and code.
[Model Release] March, 2023: BEiT-3 pretrained models and code.
March, 2023: Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
January, 2023: VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
[Model Release] January, 2023: E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
November, 2022: TorchScale 0.1.1 was released!
November, 2022: TrOCR was accepted by AAAI 2023.
[Model Release] November, 2022: XDoc BASE models for cross-format document understanding.
[Model Release] September, 2022: TrOCR BASE and LARGE models for Scene Text Recognition (STR).
[Model Release] September, 2022: BEiT v2 code and pretrained models.
August, 2022: BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
June, 2022: [**DiT

Unilm

Install / Use

README