I had the privilege yesterday of attending a session with an engineer from McAfee. Before giving a presentation on McAfee’s Host Intrusion Prevention System and Whole Disk Encryption products, we were able to have an hour-long Q&A session about the bad DAT (antivirus definitions) file that wreaked worm-like havoc for thousands of computers in my organization and many, many more than that world wide when McAfee’s VirusScan Enterprise started detecting and quarantining a legit networking file in Windows XP SP3. The engineer was a very good sport to even come to an angry mob of IT Professionals armed with pitchforks and torches. As part of the Q&A, he led out with a quick presentation to explain the cause of the problem, lessons McAfee has painfully learned, and steps they’re taking to prevent this from happening in the future. As you can imagine, it proved an interesting presentation and a lively discussion.
Most of McAfee’s write-ups on the incident are technical in nature but from the David DeWalt, the President and CEO of McAfee, and the visiting engineer we can learn a few more interesting details.
We learn that an engineer who has been with McAfee for about 15 years was working on a project for a mission-critical government defense customer. As part of his work, he modified the quality assurance environment to match the customer’s but the standard QA was not reinstated. This masked the problem DAT 5958 on Windows XP SP3 and the definitions were released without knowledge that the definitions encountered a false positive with a system critical file. The definitions were not sufficiently specific enough to catch the w32/wecorl.a virus without VirusScan Enterprise also detecting svchost.exe as a false positive.
Lessons McAfee Learned
The lessons McAfee learned from this incident go hand-in-hand with the steps they’re taking to prevent the problem. It’s good to see that they’ve been stirred into action and they’re seeing their products in a new light and from a new perspective that allows them to make a better product.
One of the most prominent lessons learned was that they lacked a way to pull a DAT quickly if it was found to be bad. Since they mirror the DAT out, there was no central repository to unplug and stop the spread quickly.
Along those lines, they realized the lack of a means to communicate with customers rapidly. They plan to utilize the currently static links on the ePO dashboard with more dynamic links that would be current and relevant to on-going issues. There may also be some other methods investigated for improving smart messaging.
They also want to leverage Artemis, their heuristics-based and cloud-delivered antivirus module to prevent this sort of thing from happening with white listing critical system files. Improvements to Artemis are to come in the 3rd quarter of 2010. Along with that, in 2011 a more efficient engine is planned that will have 1/3 the footprint it currently has for the virus scan.
McAfee acquired a company called Solidcore last year and they want to utilize them more for better white-listing in their products.
Steps to Prevent
McAfee will be increasing audits and third party audits for quality control of their products and in their company. They will also be instating automated fail-safe change controls to prevent the exact situation that caused this incident from occuring.
Along with the change controls being put in place, they are going to have three redundant levels of QA before a DAT gets pushed out: their main lab, a secondary lab, and McAfee IT. They’re also setting up a “Customer Excellence Test Lab.” This allows you as a customer to create a standard image and then give it to McAfee where they’ll test their DATs and other updates against your standard installations. This should prevent those home-built applications from getting picked up as false positives along with the system critical files.
Currently the ePO has the ability to rollback DATs but they want to expand upon that ability to make it easier and work in less-than-ideal circumstances. They also want to eliminate DAT files from fixed function devices. There are many things that run on a computer but some of those things have no way of having viruses or malware introduced to their systems. Why then, do they need to always be running the latest DATs? McAfee sees the logic in this and hopes to change this functionality and make these devices more reliable.
I forget the details, if there were any, but McAfee will also be introducing a Quality Certification Standard in the second half of 2010.
If you are a home or home office user of McAfee and were affected by the bad DAT file on April 21st, McAfee is hoping to make it up to you:
For impacted home or home office customers who have incurred costs to repair PCs as a result of the security update issue, McAfee may reimburse reasonable expenses, such as a visit to a local tech support specialist. Additionally, because we value our loyal customers, home or home office users whose PCs were rendered inoperable or severely impaired as a result of the security update are being offered a free two-year extension of their current McAfee subscription product at no charge. Click here for more information on how to apply. OFFER EXPIRES ON MAY 31, 2010.
If you are a business user and were impacted by the bad DAT file, McAfee also has this for you:
McAfee is offering a complimentary one year subscription to our automated security Healthcheck Platform. This will include a 4-hour session of remote consulting in which we will help you set up and run the health check, review your policies, server configuration and environment, interpret the results and provide recommendations based on McAfee best practices. If you would like to take advantage of this offer, you must register at www.mcafeequickstart.com/register by June 15, 2010.
No word for small businesses that were affected and had to incur costs of bringing in an IT consultant for unexpected hours though. They probably felt the impact pretty significantly with the worst of both world’s: No on-staff IT and enough computers that business on that day was negatively affected.
McAfee also wants people to know of their Quick Tip tutorials for videos of using their products, subscribing to Support Notification Service emails, and subscribing to the McAfee Labs Security blog. Having more channels will allow McAfee to communicate quickly to its customers.
In quarter 3 of 2010, McAfee will be establishing a Customer Community Advisory Board to provide feedback directly to the company and in quarter 4 of 2010, McAfee will be holding its First Annual Cyber Crisis drill to help get prepared for a technology-based disaster. This would not be practice for other bad DAT files down the road, but instead something similar to the war game, Cyber Shockwave, we painfully saw played out on CNN.
Overall, nobody wants to over-react. While this had a terrible effect on organizations from a few hours to a few days, it has only happened this one time in the last… ~4 years. Is all this hub-bub worth it? McAfee will hopefully be quite improved from all its lessons learned. So what are the official recommendations McAfee has to avoid being hit by a bad DAT file?
#1. Delay DAT release and let somebody else suffer
The fact that the bad DAT on April 21st happened fairly early in the morning probably saved McAfee a lot of grief as it prevented many of the corporations and organizations on the West coast of the USA from joining in on the suffering. If your organization delayed pushing out the DAT until a while after, you may not have even known there was a problem. What happens if everybody delays?
#2. Set up a testing method in our environment
Use the evaluation branch in ePO and set a few machines (a cross-section sample of the computers you manage) to be beta testers for each DAT release. If after a while no machines are negatively impacted, manually move the DAT over to the Current branch.
The second solution is the one currently opted for though it requires additional staff time to manually switch over the DAT. McAfee also does not have a standard time that it releases DAT files making the manual task that much harder to rely on. It typically is released around 10AM PST but has a substantial yet unknown standard deviation away from that.
You may also want to rethink using the Global Update feature in ePO which wakes up agents and pushes out updates once the ePO server has them. This speeds up the distribution process, a good thing when the DATs are good and a bad thing when the DATS are bad.
#3 Add a Feature
In reaction to the DAT 5958, people are putting in a Feature Modification Request that would more ideally handle this situation. The current request is for a feature that would automatically download the latest DAT into the Evaluation branch, wait for a specified timer to go off (say, 4 hours), and then move the DAT into the current branch.
Another feature that would be nice would be related to statistics gathering and the first recommended solution. DATs are released almost daily and we try to rush them out the door to provide ourselves the maximum protection. It would be very interesting (and helpful for planning) if stats could show how many detections there are across an organization that stem from the latest DAT.
We made it through alive and it looks like McAfee did as well. With a lot of improvements planned in the future, hopefully McAfee will become the better company we’re all wanting.