Gemini 3.5 Flash can now process inputs with a one million token capacity, a significant increase for artificial intelligence models. This new iteration builds upon Gemini 3 Flash, prioritizing incremental improvements to existing infrastructure rather than a complete architectural redesign. The model can understand vast amounts of text and generate responses up to 64,000 tokens in length, suggesting a robust ability to produce detailed and complex outputs. According to model cards released by its developers, Gemini 3.5 Flash is the next in a series of highly capable, natively multimodal reasoning models, evaluated across areas including reasoning, coding, and long-context understanding.
Gemini 3.5 Flash Model Architecture and Inputs
Gemini 3.5 Flash expands the boundaries of large language model input capacity, accepting prompts up to one million tokens. This substantial increase, detailed in recently published model cards, moves beyond the limitations of many contemporary AI systems and opens possibilities for processing lengthy documents or complex conversational histories. The model is capable of digesting vast amounts of information and generating remarkably detailed responses, with a maximum output length of 64,000 tokens. Central to the development of Gemini 3.5 Flash is its direct lineage from Gemini 3 Flash. The documentation consistently emphasizes this dependency, stating, “Gemini 3.5 Flash is based on Gemini 3 Flash.” This is not a complete architectural reimagining, but rather a deliberate strategy of iterative improvement, leveraging existing infrastructure and refining established capabilities.
The model accepts a variety of inputs, including text strings, images, audio, and video files, broadening its potential applications beyond purely textual tasks. For those interested in the underlying structure, the model card directs readers to the Gemini 3 Flash documentation for further details on the architecture. Evaluation benchmarks reveal significant performance gains across multiple domains. In coding, Gemini 3.5 Flash achieved 76.2% on the Terminal-bench 2.1, surpassing Gemini 3 Flash’s 58.0% score. Agentic capabilities also saw improvements, with a score of 83.6% on the Agentic MCP Atlas benchmark, compared to 62.0% for the earlier model. The model demonstrates enhanced performance in multimodal reasoning, multilingual processing, and long-context understanding, as evidenced by its 77.3% score on the MRCR v2 long context benchmark at 128,000 tokens. According to internal evaluations, “Overall, Gemini 3.5 Flash outperforms Gemini 3 Flash across both safety and tone, while keeping unjustified refusals low.” The company continues to refine these evaluations, acknowledging that variations are expected and that manual review is crucial to confirm the absence of dangerous material.
Coding and Multimodal Benchmark Results
The development of large language models is defined by a push for increased capacity and capability, with developers striving to create systems that can process more information and reason and generate complex outputs. Recent evaluations demonstrate a trend toward models excelling in both coding and multimodal tasks, areas crucial for building versatile artificial intelligence. Gemini 3.5 Flash, the latest iteration from Google, enters this competitive field with performance metrics that highlight significant improvements over its predecessors and rival systems. Gemini 3.5 Flash demonstrates a substantial leap in coding proficiency. On the Terminal-bench 2.1, a benchmark for agentic terminal coding, the model achieved 76.2% accuracy, a considerable increase from the 58.0% attained by Gemini 3 Flash.
This improvement extends to more complex coding challenges; on the SWE-Bench Pro, assessing diverse agentic coding tasks with a single attempt, Gemini 3.5 Flash scored 55.1% compared to 49.6% for the earlier version. These gains are not isolated to specific tasks, as evidenced by an 83.6% score on the Agentic MCP Atlas benchmark, which tests multi-step workflows, exceeding Gemini 3 Flash’s 62.0%. Google notes that “Additional benchmarks and details on approach, results and their methodologies can be found at: deepmind.com/models/evals-methodology/gemini-3-5-flash,” providing transparency into the evaluation process. Beyond coding, Gemini 3.5 Flash showcases enhanced multimodal reasoning abilities. The model achieved 84.2% on the CharXiv benchmark, which requires information synthesis from complex charts, surpassing Gemini 3 Flash’s 80.3% score. Similarly, on the MMMU-Pro benchmark, testing multimodal understanding and reasoning, the model reached 83.6% accuracy, exceeding the 81.2% achieved by its predecessor.
The ability to handle long-context inputs is also a key differentiator; Gemini 3.5 Flash achieved 77.3% at a one million token count, demonstrating a capacity far beyond many contemporary models. These results suggest that Gemini 3.5 Flash is not merely processing data, but effectively extracting meaning from complex, multi-faceted inputs, positioning it as a powerful tool for a range of applications.
We expect variation in our automated safety evaluations results, which is why we review flagged content to check for egregious or dangerous material.
Google
Long-Context Performance on MRCR v2 and Reasoning
Researchers at DeepMind are pushing the boundaries of artificial intelligence with Gemini 3.5 Flash, a new iteration of their large language model designed to excel at processing extensive information and generating detailed outputs. A key benchmark illustrating this capability is the MRCR v2 test, specifically evaluating long-context performance. Results released indicate Gemini 3.5 Flash achieved a 77.3% score on the MRCR v2 benchmark at a 128,000 token length, surpassing the 67.2% attained by Gemini 3 Flash. However, the model truly distinguishes itself when processing even larger inputs; at a one million token count, it reached 26.6% accuracy, a substantial leap for long-form comprehension. These figures are particularly noteworthy when contrasted with other models, such as Claude Sonnet 4.6, which scored 59.3% at 128,000 tokens, and GPT-5.5, for which one million token data is unavailable.
The ability to maintain performance at such high token counts suggests a refined architecture capable of effectively managing information density. Beyond sheer input capacity, Gemini 3.5 Flash also exhibits impressive output capabilities, generating text sequences up to 64,000 tokens in length. This extended generation capacity is not merely about producing verbose responses; it allows for the creation of nuanced, detailed content, essential for complex agentic workflows and multi-week enterprise processes. DeepMind’s evaluation methodology, detailed at deepmind.com/models/evals-methodology/gemini-3-5-flash, highlights the model’s performance across a range of benchmarks, including coding, multimodal understanding, and multilingual performance. According to the model card documentation, “Gemini 3.5 Flash is well-suited for users, developers, and enterprises, some use cases include: agentic workflows, coding tasks, and multi-week enterprise processes.” The consistent referencing of Gemini 3 Flash as the base model underscores a deliberate strategy of incremental improvement, prioritizing refinement of existing infrastructure over a complete architectural overhaul, a path that may offer advantages in terms of stability and resource allocation.
Safety Evaluations: Text, Image, and Tone Improvements
The release of Gemini 3.5 Flash is not solely about expanded capabilities; a significant focus has been placed on refining its safety profile and response characteristics. These assessments, conducted through automated systems and manual “red teaming” exercises, demonstrate a nuanced approach to responsible AI development. Automated content safety evaluations, measuring adherence to established policies, showed a decrease of 3.9% in safety performance for text-to-text interactions, but developers note this was largely attributable to refined evaluation methods and not a genuine regression in safety. A similar 2.6% decrease was observed in multilingual safety assessments, again attributed to improved evaluation techniques. Crucially, the model demonstrated an 8.9% improvement in objective tone when issuing refusals, indicating a more measured and less potentially alarming response to sensitive prompts.
The model also exhibited a 0.8% increase in its ability to respond to borderline prompts safely, minimizing unnecessary censorship while upholding safety standards. These improvements were not simply measured against the previous generation; Gemini 3.5 Flash was also benchmarked against competitors like Claude Sonnet, Claude Opus, and GPT-5.5. While performance varied across different evaluation categories, the data suggests a competitive safety profile. The team conducted specialized red teaming exercises, employing external experts to probe the model for vulnerabilities, particularly concerning child safety. “For child safety evaluations, Gemini 3.5 Flash satisfied required launch thresholds, which were developed by expert teams to protect children online and meet Google’s commitments to child safety across our models and Google products.” assessments aligned with the company’s Frontier Safety Framework indicate that Gemini 3.5 Flash, like its predecessor Gemini 3.1 Pro, is unlikely to reach thresholds that would necessitate heightened safety measures. The focus remains on continuous refinement of both automated evaluations and query sets to maintain a high standard of safety and responsible AI behavior.
Overall, Gemini 3.5 Flash outperforms Gemini 3 Flash across both safety and tone, while keeping unjustified refusals low.
