Articles

NVIDIA Nemotron 3 Nano tutorial: Build a research paper summarizer

  • AI Engineers build complex systems using foundation models, LLMs, and AI agents. You will learn how to design, build, and deploy AI systems.
    • Includes 16 Courses
    • With Certificate
    • Intermediate.
      20 hours
  • Explore OpenAI’s API and learn how to write more effective generative AI prompts that help improve your results.
    • Beginner Friendly.
      < 1 hour

What is NVIDIA Nemotron 3 Nano?

NVIDIA Nemotron 3 Nano is an open-source language model that handles up to 1 million tokens in a single context window - roughly 750,000 words or an entire 200-page research paper in one go. Released on December 15, 2025, it uses 30 billion total parameters but activates only 3.6 billion per token through Mixture-of-Experts routing, keeping inference fast.

The architecture combines Mamba-2 layers for long-sequence processing with attention layers for precise reasoning. A typical research paper is 50,000 to 150,000 tokens, so the 1M token window handles complete papers without splitting them.

Available under NVIDIA Open Model License in three formats: BF16 (60GB VRAM), FP8 (32GB VRAM), and GGUF (24GB+ VRAM). For this tutorial, we’ll use free API access through OpenRouter.

We’ll build a research paper summarizer that accepts PDF uploads, extracts text, and generates structured summaries. Before building, let’s look at the five architectural features that make processing entire papers practical.

Key features of NVIDIA Nemotron 3 Nano architecture

The model’s architecture has five features that make processing entire research papers practical. Let’s look at each one.

  1. Hybrid Mamba-Transformer design

    Standard transformers scale quadratically (O(n²)), requiring 250 billion operations for 500,000 tokens. Nemotron’s 23 Mamba-2 layers scale linearly (O(n)), needing only 500 million operations. Six attention layers handle precise alignment between distant sections.

  2. Mixture-of-Experts routing

    Each token gets processed by only 6 experts out of 128 available, plus 1 shared expert. This keeps the model smart (30B total parameters) without being slow (3.6B active per token).

  3. 1 million token context window

    Linear scaling makes this practical. The model maintains 87.5% accuracy at 64K tokens, 82.9% at 128K, and 70.6% at 512K. Most research papers fall in the 50K-150K range.

  4. No positional embeddings (NoPE)

    Mamba-2 handles sequence information naturally, so the model works with any length without degradation beyond training length.

  5. FP8 quantization

    Uses half the memory (32GB vs 60GB) while retaining 99% accuracy, making it accessible on more hardware.

These architectural choices enable efficient long-context processing. Here’s what this means in practice.

How NVIDIA Nemotron 3 Nano handles long context processing

Traditional transformers struggle with long documents because memory requirements grow quadratically. Nemotron 3 Nano uses Mamba-2 layers that maintain a fixed-size state. A 50K token paper and a 500K token paper use roughly the same memory. Processing time scales linearly: 100 pages takes 2-3 minutes, 200 pages takes 4-6 minutes.

This means entire papers go in one API request. The model sees everything at once and connects sections naturally. For our summarizer, we’ll extract text from PDFs, estimate tokens (4 characters per token), and warn at 800K tokens with an option to remove references.

With this understanding of how the model works, let’s connect to it and start building.

Accessing NVIDIA Nemotron 3 Nano through APIs

The easiest way to use Nemotron 3 Nano is through OpenRouter, which offers free access.

Create an account at OpenRouter and generate an API key from the dashboard. The free tier model identifier is nvidia/nemotron-3-nano-30b-a3b:free. Note that the free tier logs prompts and outputs for model improvement.

Store the API key in a .env file:

OPENROUTER_API_KEY=your-api-key

The API uses OpenAI-compatible endpoints at https://openrouter.ai/api/v1/chat/completions. Basic request structure:

import httpx
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"model": "nvidia/nemotron-3-nano-30b-a3b:free",
"messages": [{"role": "user", "content": "Your prompt here"}],
"stream": True
}
with httpx.stream("POST", url, headers=headers, json=payload) as response:
for line in response.iter_lines():
# Process streaming response

For the thinking budget feature, add this to the payload:

payload["extra_body"] = {
"thinking": {
"enable": True,
"budget": 2048
}
}

That’s the basic API setup. Now, let’s build the actual application starting with PDF text extraction.

NVIDIA Nemotron 3 Nano in practice: building a research paper summarizer

Now we’re going to build the actual application. We’ll create a Streamlit app that takes PDF uploads and uses Nemotron 3 Nano to generate summaries. The code is split across three files to keep things organized.

Step 1: Set up the project

Let’s start by creating a new directory and setting up our environment:

mkdir research-summarizer
cd research-summarizer
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

Now install the packages we need:

pip install streamlit pymupdf httpx python-dotenv

Quick breakdown:

  • streamlit builds the web interface
  • pymupdf extracts text from PDFs
  • httpx handles API requests with streaming
  • python-dotenv loads our API key from environment variables.

Before we write any code, let’s set up the API key. Create a file called .env in your project folder:

OPENROUTER_API_KEY=your_api_key_here

Replace your_api_key_here with the actual key from your OpenRouter dashboard.

Now create three empty Python files: pdf_utils.py, summarizer.py, and app.py. We’ll fill them in the next steps.

Step 2: Extract text from PDF files

Open pdf_utils.py. First, let’s add the imports:

import fitz # PyMuPDF
import re
from typing import Tuple, Dict

Now we’ll write the main extraction function. This takes the uploaded file from Streamlit and pulls out all the text:

def extract_text_from_pdf(uploaded_file) -> Tuple[str, Dict]:
"""Extract text from uploaded PDF file."""
file_content = uploaded_file.read()
doc = fitz.open(stream=file_content, filetype="pdf")
text = []
for page in doc:
text.append(page.get_text())
full_text = "\n".join(text)

We read the file content, open it with PyMuPDF, loop through each page, grabbing the text, and join everything into one big string.

Let’s also grab some metadata from the PDF:

metadata = {
"total_pages": len(doc),
"title": doc.metadata.get("title", ""),
"author": doc.metadata.get("author", "")
}
uploaded_file.seek(0) # Reset file pointer
return full_text, metadata

The seek(0) resets the file pointer so Streamlit can reuse the file if needed.

Next, we need to estimate the token count. Remember, we’re aiming to stay under 1 million tokens:

def estimate_token_count(text: str) -> int:
"""Estimate token count (approx 4 chars per token)."""
if not text:
return 0
return len(text) // 4

This is a rough estimate using 4 characters per token. Good enough for checking if we’re approaching the limit.

Add a warning function for long papers:

def should_warn_about_length(text: str, limit: int = 800000) -> bool:
"""Check if text is approaching 1M token limit."""
return estimate_token_count(text) > limit

We set the threshold at 800K to give some buffer room.

The last function in this file handles removing the references section for very long papers:

def remove_references_section(text: str) -> str:
"""Remove references/bibliography section to save tokens."""
patterns = [
r'\n\s*References\s*\n',
r'\n\s*Bibliography\s*\n',
r'\n\s*Literature Cited\s*\n'
]
cutoff_idx = -1
text_len = len(text)
search_start = int(text_len * 0.8) if text_len > 5000 else 0
for pattern in patterns:
matches = list(re.finditer(pattern, text[search_start:], re.IGNORECASE))
if matches:
match = matches[-1]
abs_start = search_start + match.start()
if abs_start > cutoff_idx:
cutoff_idx = abs_start
if cutoff_idx != -1:
return text[:cutoff_idx]
return text

This searches the last 20% of the document for common reference headers and cuts everything after. References can be 10K-30K tokens, but don’t help with summarization.

That’s it for pdf_utils.py. Save the file.

Step 3: Build the API handler

Open the summarizer.py file. We’re building a class that talks to the Nemotron API and streams back responses.

Start with imports and environment setup:

import os
import json
import httpx
from typing import Iterator
from dotenv import load_dotenv
load_dotenv()

Now let’s create the PaperSummarizer class and initialize it:

class PaperSummarizer:
def __init__(self):
self.api_key = os.getenv("OPENROUTER_API_KEY")
self.base_url = "https://openrouter.ai/api/v1/chat/completions"
self.model = "nvidia/nemotron-3-nano-30b-a3b:free"
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
"HTTP-Referer": "http://localhost:8501",
"X-Title": "Research Paper Summarizer"
}

The headers include referrer and title because OpenRouter requires them for attribution.

Next, we need prompt templates. Let’s create a helper method that generates different prompts based on the mode:

def _create_prompt(self, paper_text: str, mode: str) -> str:
if mode == "quick":
return f"""You are an expert research assistant. Read the following research paper and provide a concise executive summary.
Research Paper:
{paper_text}
Please provide:
1. A 3-5 paragraph executive summary covering:
- Research question/objective
- Methodology approach
- Key results
- Main conclusions
Keep the summary focused and accessible."""

Quick mode asks for a brief overview. Now add the detailed mode prompt:

else: # detailed mode
return f"""You are an expert research assistant. Read the following research paper and provide a comprehensive analysis.
Research Paper:
{paper_text}
Please provide a structured analysis with the following sections:
1. EXECUTIVE SUMMARY (200-300 words)
2. KEY FINDINGS (5-10 bullet points with supporting evidence)
3. METHODOLOGY OVERVIEW (research design, data sources, techniques)
4. RESULTS SYNTHESIS (quantitative and qualitative findings)
5. LIMITATIONS AND FUTURE WORK
Please reference page numbers or sections when making specific claims."""

Detailed mode requests structured analysis with specific sections. The prompt explicitly asks for section references to keep the model grounded in the actual paper.

Now for the main summarization method. This is where we send the request and stream back the response:

def summarize_paper(self, paper_text: str, mode: str = "quick",
enable_thinking: bool = False) -> Iterator[str]:
prompt = self._create_prompt(paper_text, mode)
payload = {
"model": self.model,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.5 if enable_thinking else 0.2,
"stream": True,
"max_tokens": 4096
}

Temperature controls randomness. We use 0.2 for quick mode (very focused) and 0.5 for detailed mode (slightly more creative).

If thinking is enabled, we add the thinking budget:

if enable_thinking:
payload["extra_body"] = {
"thinking": {"enable": True, "budget": 2048}
}

This tells the model to spend up to 2048 tokens on internal reasoning before generating the final answer.

Now let’s send the request and handle the streaming response:

try:
with httpx.stream("POST", self.base_url, headers=self.headers,
json=payload, timeout=60.0) as response:
if response.status_code != 200:
yield f"Error: API returned status {response.status_code}"
return
for line in response.iter_lines():
if line.startswith("data: "):
data_str = line[6:]
if data_str.strip() == "[DONE]":
break
try:
data = json.loads(data_str)
content = data.get("choices", [{}])[0].get("delta", {}).get("content", "")
if content:
yield content
except json.JSONDecodeError:
continue
except Exception as e:
yield f"\n\nConnection Error: {str(e)}"

We’re using httpx.stream() to keep the connection open. Each line from the API starts with data: and contains a JSON chunk. We parse these chunks and yield the content piece by piece. This is what enables the real-time text display in Streamlit.

Finally, add a validation function outside the class:

def validate_api_key() -> bool:
api_key = os.getenv("OPENROUTER_API_KEY")
return api_key is not None and api_key != "your_api_key_here" and len(api_key) > 10

This checks that the API key exists and isn’t a placeholder.

Save the summarizer.py file. We’re ready for the interface.

Step 4: Create the Streamlit interface

Open the app.py file. This is where everything comes together into a web interface.

Start with imports and page configuration:

import streamlit as st
import time
from pdf_utils import (
extract_text_from_pdf,
estimate_token_count,
should_warn_about_length,
remove_references_section
)
from summarizer import PaperSummarizer, validate_api_key
st.set_page_config(
page_title="Research Paper Summarizer - NVIDIA Nemotron 3 Nano",
page_icon="📄",
layout="wide"
)

Now let’s build the main function. First, the header and API key check:

def main():
st.title("📄 Research Paper Summarizer")
st.markdown("### Powered by NVIDIA Nemotron 3 Nano")
st.markdown("""
Upload a research paper (PDF) and get AI-generated summaries using NVIDIA's Nemotron 3 Nano model.
The model's 1M token context window allows processing entire papers without chunking.
""")
if not validate_api_key():
st.error("⚠️ API key not configured!")
st.markdown("""
**Setup Instructions:**
1. Create a free account at [OpenRouter](https://openrouter.ai)
2. Generate an API key from your dashboard
3. Create a `.env` file in the project directory
4. Add: `OPENROUTER_API_KEY=your_api_key_here`
5. Restart the application
""")
st.stop()

If there’s no API key, we show setup instructions and stop. No point going further without it.

Let’s add the sidebar with configuration options:

with st.sidebar:
st.header("⚙️ Configuration")
summary_mode = st.radio(
"Summary Mode",
["Quick Overview", "Detailed Analysis"],
help="Quick: 3-5 paragraphs (2-3 min) | Detailed: Comprehensive breakdown (8-12 min)"
)
mode = "quick" if summary_mode == "Quick Overview" else "detailed"
enable_thinking = (summary_mode == "Detailed Analysis")
st.divider()
remove_refs = st.checkbox(
"Remove references section",
value=False,
help="Remove bibliography to save tokens (10K-30K tokens)"
)

The sidebar gives users control over summary mode and optional reference removal. We set enable_thinking to True only for detailed mode.

Add model info at the bottom of the sidebar:

st.divider()
st.markdown("### 📊 Model Info")
st.markdown("""
- **Model**: NVIDIA Nemotron 3 Nano
- **Context**: 1M tokens
- **Architecture**: Hybrid Mamba-Transformer MoE
- **Parameters**: 30B total, 3.6B active
""")

Now create the main layout with file upload:

col1, col2 = st.columns([1, 2])
with col1:
st.header("Upload Paper")
uploaded_file = st.file_uploader(
"Choose a PDF file",
type=['pdf'],
help="Upload an academic paper (max 50MB)"
)
if uploaded_file:
st.success(f"✅ Uploaded: {uploaded_file.name}")
st.info(f"📦 File size: {uploaded_file.size / 1024 / 1024:.2f} MB")
with col2:
st.header("Summary Output")
if uploaded_file is None:
st.info("👈 Upload a PDF file to get started")
else:
if st.button("🚀 Generate Summary", type="primary", use_container_width=True):
process_paper(uploaded_file, mode, enable_thinking, remove_refs)

Two columns keep things organized. Upload on the left, output on the right. The generate button only appears when a file is uploaded.

Nvidia_Nemotron_2_nano_ Two_columns

Now we need the processing function. This is where all our pieces come together:

def process_paper(uploaded_file, mode: str, enable_thinking: bool, remove_refs: bool):
status_container = st.empty()
progress_bar = st.progress(0)
try:
status_container.info("📖 Extracting text from PDF...")
progress_bar.progress(20)
paper_text, metadata = extract_text_from_pdf(uploaded_file)
st.success(f"✅ Extracted {metadata['total_pages']} pages")

We start by extracting text. The progress bar shows we’re 20% done.

status_container.info("🔢 Counting tokens...")
progress_bar.progress(40)
if remove_refs:
paper_text = remove_references_section(paper_text)
st.info("📚 References section removed")
estimated_tokens = estimate_token_count(paper_text)
st.info(f"📊 Estimated tokens: ~{estimated_tokens:,}")
if should_warn_about_length(paper_text):
st.warning("⚠️ This paper is quite long (>800K tokens). Consider enabling 'Remove references section'.")

We count tokens and warn if the paper is really long. 40% progress now.

status_container.info(f"🤖 Generating {mode} summary...")
progress_bar.progress(60)
summarizer = PaperSummarizer()
output_container = st.empty()
summary_text = ""
start_time = time.time()
for chunk in summarizer.summarize_paper(paper_text, mode, enable_thinking):
summary_text += chunk
output_container.markdown(summary_text + "▌")
output_container.markdown(summary_text)

This is where the magic happens. We stream chunks from the API and display them with a blinking cursor effect. Each chunk gets added to summary_text and the display updates in real-time.

elapsed_time = time.time() - start_time
progress_bar.progress(100)
status_container.success(f"✅ Summary generated in {elapsed_time:.1f} seconds!")
st.download_button(
label="📥 Download Summary",
data=summary_text,
file_name=f"summary_{uploaded_file.name.replace('.pdf', '.txt')}",
mime="text/plain"
)
except Exception as e:
status_container.error(f"❌ Error: {str(e)}")
progress_bar.progress(0)
if __name__ == "__main__":
main()

When done, we show completion time and offer a download button. If anything goes wrong, we catch the error and display it.

Save the file. That’s all three files complete.

Step 5: Run the application

Let’s fire it up. In your terminal, run:

streamlit run app.py

Your browser should open to http://localhost:8501. The interface loads with the title and the upload section ready.

NVIDIA Nemotron 3 Nano research paper summarizer Streamlit interface with upload section and configuration sidebar

The main area has two columns. Left side is for PDF upload with file size display. The right side will show the generated summary once processing starts.

Step 6: Test with a research paper

Now let’s test it with an actual paper. For this example, we’ll use the BERT paper - “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. This is a 16-page paper from Google AI Language, perfect for testing the model’s capabilities.

Download the paper or use one from your local files. Click “Choose a PDF file” and select the paper.

Select “Quick Overview” mode for this first test. This gives us a 3-5 paragraph summary in about 2-3 minutes. Click the “Generate Summary” button.

The progress bar appears and starts filling. First, it extracts text from all 16 pages. Then it counts tokens and displays the estimate. The BERT paper comes in around 12,000-18,000 tokens, well within the model’s capacity.

Watch as the summary appears word by word in real-time. The streaming display shows a cursor that moves as text generates.

NVIDIA Nemotron 3 Nano generating research paper summary in real-time with streaming text display from BERT PDF upload to complete summary output

For testing the detailed mode, select “Detailed Analysis” and upload the same paper again.

Conclusion

NVIDIA Nemotron 3 Nano represents a practical approach to long-context language models through its hybrid Mamba-Transformer architecture. The model’s key capabilities include:

  • 1M token context window with linear scaling, eliminating chunking for most documents
  • Hybrid architecture combining 23 Mamba-2 layers, 6 attention layers, and 23 MoE layers for efficiency
  • 30B total parameters with 3.6B active per token through expert routing, maintaining fast inference
  • Performance of 87.5% accuracy at 64K tokens and 70.6% at 512K tokens for long-context tasks
  • Open-source release under NVIDIA Open Model License with full weights, training data, and recipes

The research paper summarizer demonstrates how the 1M context window handles real documents without chunking or retrieval systems. The model’s linear scaling and efficient routing make long-context processing practical for document analysis, code understanding, and research applications. For deeper understanding of transformer architectures and how models like Nemotron work, Codecademy’s Finetuning Transformer Models course covers the fundamentals.

Frequently asked questions

1. What is the difference between GPT OSS and Nemotron 3 Nano?

GPT-OSS-20B uses standard transformer architecture with 20B parameters and shorter context windows, while Nemotron 3 Nano employs a hybrid Mamba-Transformer with MoE routing (30B total, 3.6B active per token) supporting 1M tokens. Nemotron runs 2.2x faster on identical hardware and scores higher on SWE-Bench (38.8% vs 34%) and LiveCodeBench (68.3% vs 61%).

2. What is the benchmark for Nemotron Nano 3?

The model achieves 38.8% on SWE-Bench Verified, 73.0% on GPQA Diamond, 89.1% on AIME 2025, and 68.3% on LiveCodeBench. Long-context accuracy is 87.5% at 64K tokens, 82.9% at 128K tokens, and 70.6% at 512K tokens.

3. What is the latest model of Nemotron?

Nemotron 3 Nano (30B total, 3.6B active) is currently available as of December 15, 2025. Nemotron 3 Super (100B total, 10B active) and Ultra (500B total, 50B active) are planned for H1 2026, all supporting 1M token contexts.

4. What is NVIDIA Nemotron?

NVIDIA Nemotron is an open-source language model family designed for agentic AI with a hybrid Mamba-Transformer MoE architecture supporting 1M token contexts. Released under NVIDIA Open Model License, the models include full weights, training data, and recipes for commercial use with attribution requirements.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • AI Engineers build complex systems using foundation models, LLMs, and AI agents. You will learn how to design, build, and deploy AI systems.
    • Includes 16 Courses
    • With Certificate
    • Intermediate.
      20 hours
  • Explore OpenAI’s API and learn how to write more effective generative AI prompts that help improve your results.
    • Beginner Friendly.
      < 1 hour
  • Apply Gemini Flash Image for generating campaign-ready visuals through prompts, brand-safe edits, character continuity, watermarking, and polished design skill.
    • Beginner Friendly.
      < 1 hour