История изменений
Исправление serg002, (текущая версия) :
А есть тест, который показывает на реальном примере ошибки памяти?
Вот это меня не впечатляет:
Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study «Data integrity»: «Bit Error Rate of 10-12 for their memory modules … observed error rate is 4 orders of magnitude lower than expected»; 2009 Google’s «DRAM Errors in the Wild: A Large-Scale Field Study»). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM («mean correctable error rates of 2000–6000 per GB per year»).
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
Т.е у меня 1-5 ошибок в час for 8GB of RAM. А чего же тогда у меня всё работает, а не глюквой покрывается?
Исправление serg002, :
А есть тест, который показывает на реальном примере ошибки памяти?
Вот это меня не впечатляет:
Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study «Data integrity»: «Bit Error Rate of 10-12 for their memory modules … observed error rate is 4 orders of magnitude lower than expected»; 2009 Google’s «DRAM Errors in the Wild: A Large-Scale Field Study»). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM («mean correctable error rates of 2000–6000 per GB per year»).
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
Т.е у меня 1-5 ошибок в час per hour for 8GB of RAM. А чего же тогда у меня всё работает, а не глюквой покрывается?
Исправление serg002, :
А есть тест, который показывает на реальном примере ошибки памяти?
Вот это меня не впечатляет:
Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study «Data integrity»: «Bit Error Rate of 10-12 for their memory modules … observed error rate is 4 orders of magnitude lower than expected»; 2009 Google’s «DRAM Errors in the Wild: A Large-Scale Field Study»). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM («mean correctable error rates of 2000–6000 per GB per year»).
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
Т.е у меня qual 1-5 ошибок в час per hour for 8GB of RAM. А чего же тогда у меня всё работает, а не глюквой покрывается?
Исходная версия serg002, :
А есть тест, который показывает на реальном примере ошибки памяти?
Вот это меня не впечатляет:
Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study «Data integrity»: «Bit Error Rate of 10-12 for their memory modules … observed error rate is 4 orders of magnitude lower than expected»; 2009 Google’s «DRAM Errors in the Wild: A Large-Scale Field Study»). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM («mean correctable error rates of 2000–6000 per GB per year»).
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
Т.е у меня qual 1-5 ошибок в час per hour for 8GB of RAM. А чего же тогда у меня всё работает, а не глюквой покрывается?