Share →
Buffer

image

Applying TFS 2010 SP1 may, in rare instances, cause an error which prevents the data warehouse (and, by extension, the analysis cube) from updating work item information beyond the date and time at which the SP1 was installed.  This means all work item reports will freeze at the date in which TFS 2010 SP1 was applied.  In addition, the Work Item Tracking Warehouse Sync job will hang and never complete.



Symptoms

After a TFS 2010 SP1 install, work items are no longer processed in the data warehouse, however all of the other elements (such as version control, test cases, etc) are processed. In addition, the Work Item Tracking Warehouse Sync job hangs and never finishes. If you view the data warehouse or the analysis cube, you will find no new data has been recorded since the install of TFS 2010 SP1.

In addition, you will receive an error in the event log similar to this: TF221122: An error occurred running job Work Item Tracking Warehouse Sync for team project collection or Team Foundation server DefaultCollection

Bottom Line Up Front

There was an error in TFS 2010 SP1 that set a watermark improperly. Watermarks are used to mark how many work items have been processed into the warehouse, and where to start for the next iteration. Verifying the problem requires running a SQL query against the transactional database. Fixing the problem (also described at the bottom of the post) required executing an update directly against the SQL transactional database.

WARNING: No direct changes to the database are supported by Microsoft. In order to be supported, you should contact Microsoft PSS, and explain the problem, point them to this blog post, let them do their due diligence first, and have THEM recommend running the script!! I am hesitant to even list the fix, and you run it at your OWN RISK!

Table of Contents

Because this is a relatively large post, I want to provide a quick table of contents to the remainder of the post to help clarify the structure.

  • Troubleshooting Steps
    • Show that problem occurs on any work item changed AFTER SP1 was installed
    • Highlight two warnings that occurred during SP1 install (Ended up not being related to problem.)
    • Common, but strange warning in the Event Logs (Ended up not being related to problem.)
    • Event Logs show both Errors and Warnings that are directly related to problem
    • Similar Error found on Bing shows corrupted work item from upgrade caused problems.
    • Other errors caused by permissions or unsolved bugs in TFS, but doesn’t apply since these block ALL processing, not just work items
    • Rebooting doesn’t help.
    • Rebuilding the data warehouse / OLAP cube doesn’t help (and it gets stuck in the exact same place each time)
  • Problem Verification
    • SQL Script that identifies that there IS a problem
  • Problem Solution (WARNING: Contact MS PSS before running, since it requires making direct changes against the TFS database!)
    • SQL Script that identifies that resolves the problem

Troubleshooting Steps

The troubleshooting steps are included to walk through much of the thought process I ran through to identify the problem, and to help folks who may be running into a SIMILAR problem, but one that is not exactly the same. Feel free to go to the bottom of the post for the verification and fix steps.

First, prove that this problem occurred immediately upon install of SP1.

Here’s the strange thing: I can process all data, including all changes to the Work Item store, up until the time that SP1 was installed. So, I can only get data until that point. This shows two columns. First is the number of work items, which becomes frozen as of Sept 22 (the date of the SP1 install), however, you can see that the DATE dimension in the cube goes beyond that. That’s because everything ELSE is processing, just not the work items. Second column shows the work item revision count and it stops as of Sept 22 (the date of the SP1 install).

clip_image001[4]

 

Second, I examined the logs for the TFS 2010 SP1 install and CU1 install. Everything was clear with the exception of one [WARNING] in each of the installs:

Warning from SP1 install:

Figure: Surrounding information in upgrade log (click View Source above)

 

Warning from the CU1 install (very similar to above):

 

Figure: Surrounding information in upgrade log (click View Source above)

 

Third, here are some of the Application Event Log entries that may indicate an SP1 install failure. (Ended up NOT to be the case.)

The first is a common error each time TfsAgent throws information, warning, or error messages. You can ignore most of the content of this error log, since it happened when I’d stopped the services anyway. The only critical thing is the highlighted part. This leads me to believe there is something wrong with the SP1 install. (Google / Bing have not been helpful in this scenario, even though there were plenty of hits.)

 

Figure: Full Text of Warning (click View Source above)

Fourth, an Error received when the Work Item Tracking Warehouse Sync runs. This one matters, since it’s the one that represents the failure. And when it fails, it will NOT stop the TFS Job from running, it will simply hang the job forever. Interesting thing, though, since the warehouse IS updated to the day that the SP1 was installed, so the cube can get the data up to that point. Notice that the highlighted areas show that it is a SQL Transport level problem. (Quick reminder to everyone that ALL the other Warehouse jobs process successfully, as do the Cube jobs.) It appears as though a network name is no longer available. (Again, recall that the Work Item job can process up to the date that the TFS 2010 SP1 was installed, so it’s not failing immediately, only after processing all the work item events up to that point – we’ve repeatedly rebuilt the warehouse and cube from scratch, and each time it can process up to the date of the SP1 install.)

 

Figure: Entire Error message pertaining directly to the problem (click View Source)

Next is a related WARNING in the Application Event Log. It occurs just prior to the ERROR noted above. Notice that it states that a severe error has occurred and the results should not be trusted. It also mentions that the action was cancelled by the user. No HUMAN cancelled it, however, the TFS Job agent may have cancelled it.

 

Figure: Full Text of Warning (click View Source above)

Fifth, a Google / Bing search reveals only ONE exactly related hit for TF221122: “An error occurred running job Work Item Tracking Warehouse” “A severe error occurred on the current command”

http://www.go4answers.com/Example/tfs-2010-warehouse-update-severe-148743.aspx – Exactly the same problem, and points to a ROGUE work item that was corrupted during the upgrade from TFS 2008 to TFS 2010. We were good till the update to SP1, so maybe we got corrupted work items there…

Using the SQL at http://blog.mohammadjalloul.com/blogs/mo/archive/2011/08/29/how-to-detect-corrupted-tfs-work-items.aspx I searched for any corrupted work items. No luck. Query returned no corrupted work items.

So I queried for every work item that was changed on Sept 22 (the date of the last work item update to the warehouse) and I manually reviewed every one of the work items changed that day. I could not find any that had an error that would not permit me to resave them back to TFS, or any other indication anything was wrong.

At this point in the exploration, I did not believe that work item corruption was the problem, or at least the corruption was very hard to spot.

Sixth, Steven St Jean ran into a similar problem here: http://sstjean.blogspot.com/2010/07/tfs-2010-rtm-throws-tf221122-error-when.html

My situation is different since Analysis Services is already running under a domain account. I could change to NETWORK SERVICE to see if that would help, however, the problem is still different. He had an Analysis Services Network Service account that was failing REPEATABLY, and changing to a domain account worked. And in his case, the ENTIRE processing of the Warehouse was failing. For me, it’s ONLY the work items.

Seventh, reboot multiple times. Declared this weekend to be open maintenance, and rebooted multiple times after trying various things. No change to the behavior.

Eighth, delete and rebuild Tfs_Warehouse and Tfs_Analysis multiple times. In each case it will process the work items up to the date and time the SP1 was installed. (I can show that the last work item change prior to SP1 makes it into the warehouse/cube, but that the very next work item change, done after the SP1 was installed, does NOT make it into the warehouse/cube.)

Ninth, work directly with some of the people at Microsoft DevDiv who built the product. (Ah, the advantages of being an MVP.) This step led to the solution.

Problem Verification

Once we understood the problem more deeply, this script was run to prove that there was an issue between the two particular tables.  It verifies that there is an incorrect change order between the WorkItemsLatest and WorkItemsAre tables.  If this SQL query returns any rows, then you have the problem.  If not rows are returned, look elsewhere, because the fix script won’t help you.

Problem Solution

Below is the SQL Script that solves the problem by setting the valies of the Changed Order columns in the WorkItemsLatest table to the values currently in the WorkItemsAre table.

WARNING: RUN THIS AT YOUR OWN RISK!  I HIGHLY recommend you contact Microsoft PSS, and work with them through this issue.  This SQL script is provided “for entertainment purposes only”…

Conclusion

I’ve done a large number of TFS 2010 SP1 migrations, and this is the first time I’ve run into this problem.  So, don’t let it freak you out.  However, if you run into this problem, get PSS on the phone right away, and get this problem solved!  (As a side note, I don’t believe you’ll be charged for the call, since it does appear to be a bug in the SP1 upgrade process.)

Print Friendly
Tagged with →  
  • It is really important to note the text in red at the top of @stevenborg’s post. Do not try this at home…