FMEA-MSR: The AIAG-VDA Supplemental Method for In-Use Monitoring and System Response

The PFMEA passed customer review. The DFMEA passed customer review. Then, six months into production of the 48V mild-hybrid battery management system, a fleet of vehicles started throwing intermittent isolation faults that the design-stage detection ratings never anticipated. The failure mode existed. The cause existed. But the analysis stopped at the manufacturing gate, and FMEA-MSR—the AIAG-VDA supplemental methodology for in-use monitoring and system response—was the missing piece that should have modeled what the diagnostic loop did once the customer drove off the lot.

FMEA-MSR was added as a supplemental analysis in the 2019 AIAG-VDA FMEA Handbook to close exactly this gap. Added as a supplemental methodology in the 2019 AIAG-VDA FMEA Handbook, it analyzes how mechatronic systems detect and respond to faults after the product ships—not whether the manufacturing process can catch defects before delivery. This guide walks through what FMEA-MSR actually covers, how its risk evaluation differs from DFMEA, and when adding one to your design analysis is worth the effort.

What FMEA-MSR Actually Covers

FMEA-MSR is a supplemental analysis layered on top of a DFMEA for systems with active or passive fault monitoring during customer operation. The target systems are mechatronic: at minimum a sensor, a control unit, and an actuator—or a subset of those—with diagnostic logic that observes the system’s health while the product is in use. Examples include a battery management system monitoring cell voltage and isolation resistance, an electronic power steering system watching torque sensor plausibility, or an exhaust aftertreatment system tracking NOx sensor response time.

The analysis answers a different question than DFMEA. A DFMEA asks: can the design fail, and if so, can manufacturing or end-of-line testing detect that failure before the customer receives the product? FMEA-MSR asks: once the system is in the customer’s hands and a fault occurs, can the integrated monitoring detect it in time for the system to respond in a way that keeps the customer safe or maintains regulatory compliance?

Tip FMEA-MSR is not a replacement for DFMEA—it is a supplement. The 7-step process runs in parallel, sharing structure and function analysis with the underlying DFMEA, then diverging in failure analysis and risk evaluation to focus on diagnostic coverage and system response.

The DFMEA-MSR Boundary: Why Detection Cannot Cover Field Operation

Standard DFMEA Detection ratings evaluate controls that act before the product reaches the customer: design verification testing, end-of-line functional checks, prototype validation. They do not credit on-board diagnostics that run while the vehicle is in the customer’s driveway. That distinction matters because for safety-critical or regulatory-critical failure modes, the question is not whether validation testing caught the design weakness—it is whether the system itself can recognize a fault has occurred and bring the vehicle to a safe state.

Practitioners doing functional safety work under ISO 26262 already know this gap. The hardware FMEDA (Failure Modes, Effects, and Diagnostic Analysis) covers diagnostic coverage quantitatively for hardware random faults. FMEA-MSR provides a parallel qualitative framework for systems where diagnostic coverage matters but a full FMEDA is either overkill or unsuitable—non-ASIL applications, emissions monitoring, charging system integrity, accessibility-impacting failures, and similar in-use risk domains. For the safety-critical software side, our software FMEA under ISO 26262 walks through the parallel analysis flow.

Frequency Potential (F) Replaces Occurrence

In FMEA-MSR, the Occurrence rating is replaced with Frequency Potential (F)—a 1–10 rating that estimates how often the failure cause is expected to occur over the operational lifetime of a single instance of the product. This is a different framing than DFMEA Occurrence, which estimates causes-per-population.

The reasoning: when a hybrid powertrain inverter develops an IGBT short over a 10-year operational life, the question for monitoring design is not “how many vehicles in the fleet will experience this” but “how often, on average, will the monitoring system need to act when this failure mode occurs in a given vehicle.” The handbook’s F-rating examples align to operating-life frequency bands: F=1 means “practically never within the operational life,” F=5 lands around “once during the operational life,” and F=10 means “frequent, multiple times during the operational life.”

Practitioners coming from traditional Occurrence ratings often anchor too low here. A failure mode that would rate Occurrence 3 in a DFMEA (because it affects a small fraction of the population) may rate F=6 or higher in MSR if, when it occurs in an affected unit, it recurs intermittently across the vehicle’s life. The F rating is per-unit, not per-population.

Monitoring Criteria (M) Replaces Detection

The Detection rating is replaced with Monitoring Criteria (M), also rated 1–10, evaluating two factors together: the diagnostic coverage of the monitoring (can it sense the failure mode?) and the responsiveness of the system reaction (can it act fast enough to keep the customer safe or maintain regulatory state?). Lower is better, as with Detection.

The rating bands have an inverse structure to Detection. M=1 is near-perfect: the failure is detected within the fault-tolerant time interval, the response is automatic, and the result is a safe state with no customer harm or regulatory exceedance. M=10 is no monitoring: the failure occurs and the system has no way to recognize it, no way to respond, and no warning to the operator.

Between those endpoints, the bands look at two axes practitioners often conflate:

Sense-time: how quickly diagnostic logic recognizes the failure has happened. A continuous plausibility check on a sensor pair detects within one task cycle; a periodic self-test detects on the next scheduled run, which may be seconds to minutes later.
Response adequacy: what the system does once the fault is recognized—automatic limp-home, operator warning, controlled shutdown, or nothing more than a logged DTC. Automatic safe-state response rates lower (better) than operator-dependent warning; warning-only rates higher (worse).

Common Mistake Rating M based on diagnostic coverage alone while ignoring system response time. A monitor that detects the fault in 50 ms is irrelevant if the actuator chain to bring the system to a safe state takes 800 ms and the fault-tolerant time interval is 500 ms. The M rating must account for the full sense-to-act loop.

The 7 Steps Applied to MSR

FMEA-MSR follows the same 7-step structure as DFMEA and PFMEA: Planning and Preparation, Structure Analysis, Function Analysis, Failure Analysis, Risk Analysis, Optimization, and Results Documentation. The differences are concentrated in steps 4 through 6:

Step 4 (Failure Analysis): for each failure mode and cause already identified in the parent DFMEA, you map the diagnostic monitor that watches for the cause and the system response that fires when the monitor trips. The output is a failure chain extended from cause → failure mode → effect to include monitor → system response.
Step 5 (Risk Analysis): rate Severity (carried from DFMEA), Frequency Potential (F), and Monitoring Criteria (M). The Action Priority lookup is structured differently from the DFMEA AP table—Severity remains the dominant factor, but the F×M combinations are weighted toward regulatory and safety outcomes rather than warranty exposure.
Step 6 (Optimization): actions target either improving F (reducing how often the cause occurs through design hardening) or improving M (adding monitoring coverage, tightening sense-to-act time, escalating response severity). Most practitioners end up acting on M because hardening for cause frequency typically belongs in the parent DFMEA.

When You Need an MSR and When You Don’t

Not every DFMEA warrants an MSR supplement. The handbook is explicit on this: if the design has no active or passive monitoring components, there is nothing for MSR to analyze. The decision criteria practitioners use in practice:

Functional safety scope (ISO 26262, IEC 61508): any system in scope with safety mechanisms intended to fire during operation. MSR formalizes the qualitative side that pairs with the FMEDA quantitative coverage.
Emissions and regulatory monitoring: OBD-II / Euro 6/7 / EPA Part 86 mandate specific in-use diagnostic coverage. MSR is the typical artifact to demonstrate the analysis behind the design.
Critical battery, propulsion, and charging systems: high-voltage BMS, traction inverter, on-board charger—all have integrated diagnostics whose effectiveness drives both safety and warranty.
Driver assistance and chassis control: EPS, braking, ADAS—systems where the monitoring loop determines whether a sensor or actuator fault becomes a customer-visible loss of control.

Conversely, a purely mechanical assembly with no electronic supervision, a passive harness, or a low-criticality cosmetic component has no MSR work to do. The cost of doing MSR for everything is the more dangerous failure—the MSR for the safety-critical system becomes a check-the-box exercise instead of focused engineering.

Where the AP Threshold Lands for MSR

The AIAG-VDA Action Priority lookup for MSR uses Severity, Frequency Potential, and Monitoring Criteria as inputs, but the H/M/L assignment is biased differently than the DFMEA AP table. Severity 9–10 failure modes go High AP across nearly the entire F×M space—the methodology explicitly refuses to deprioritize safety-critical failures even when monitoring is excellent, because monitoring credit alone is not allowed to substitute for prevention of the underlying cause.

Practitioners new to MSR sometimes try to use the same RPN-to-AP intuitions they built on regular FMEA. That gets you in trouble fast. For a fuller treatment of the underlying logic, our walkthrough on why high RPN is not always the biggest risk covers the same trap from the DFMEA side. If you need to look up AP combinations interactively, the RPN and Action Priority calculator covers both DFMEA AP and the MSR variant.

Common Implementation Mistakes

Common Mistake Documenting the MSR as a separate disconnected spreadsheet. The MSR is supplemental to the parent DFMEA and must reference the same structure, function, and failure mode entries—auditors will trace back to the DFMEA and flag inconsistencies. Best practice is to handle MSR as a column extension on the DFMEA when the tooling supports it, or as a linked supplement when it does not.

Treating the DFMEA Detection rating as if it covered in-use monitoring. It does not. Detection in DFMEA is about validation and end-of-line tests pre-customer; MSR is about runtime diagnostic coverage. Re-rating Detection lower because the system has on-board diagnostics is methodologically wrong and audit-visible.
Rating M based on the strongest monitor while ignoring weaker fault paths. If two causes lead to the same failure mode but only one is covered by a monitor, M must reflect the uncovered path. Average-monitor-rating thinking misses partial coverage.
Skipping the system response time analysis. The fault-tolerant time interval (FTTI) is the budget you have between fault occurrence and the system reaching a safe state. If the diagnostic latency plus actuator response exceeds the FTTI, the monitor cannot save you. The M rating has to incorporate this.
Carrying over DFMEA Occurrence ratings directly. F is per-unit per-life, not per-population. Doing the conversion in your head during the rating session is error-prone—explicit translation guidance for the team during the MSR session reduces the rating-mismatch problem.

For teams transitioning from prevention-heavy thinking to monitoring-aware thinking, our explainer on prevention versus detection controls covers the mental model shift on the DFMEA side that has to happen before MSR can be applied cleanly.

What to Bring to Your First MSR Session

Practical preparation list, assuming a parent DFMEA is already in place:

System architecture diagram showing sensors, control units, actuators, and the communication paths between them
Diagnostic specification or DTC list, including detection thresholds and confirmation criteria
Fault-tolerant time interval (FTTI) for each failure mode that has one defined—typically from the functional safety concept or system requirements
System response definition: what each fault state does to the operator interface, the actuator commands, and the regulatory state
The parent DFMEA, with severity ratings already calibrated—Severity carries straight across, so the team does not need to re-debate it

If the parent DFMEA is incomplete or inconsistent, MSR will surface that quickly. Treat the MSR session as a forcing function for cleaning up the underlying DFMEA, not as a separate workstream to be completed in isolation. To validate AP combinations interactively while the team is rating, the RPN and Action Priority calculator covers both standard FMEA AP and MSR variants without forcing you to flip pages in the handbook during the session.

For the broader methodology context behind why AIAG and VDA added MSR as a supplemental layer rather than expanding DFMEA scope, the AIAG-VDA FMEA Handbook page covers the official rationale and the relationship to existing functional safety standards under ISO 26262-1:2018.