We have been working on an interesting skunkworks style project and are pleased to share our work with all of you. It all started when we started to notice Microsoft was archiving/deleting Support KB articles from its site – often even when the information was still pertinent. We started noticing a number of the items we track in the ADCS Hotfix Digest were soft-failing when hitting the Microsoft site (no 404 error). So we weren’t even being notified by our own tools about the deletion.
When we reached out to the Microsoft product team for ADCS they weren’t even aware the articles were being archived as it wasn’t at their behest. After several weeks it became clear the articles weren’t coming back and the problem would continue to affect us all.
This lead us to create a Microsoft KB Support site archive – think of it as a targeted Way Back Time Machine. We are calling it the Microsoft KB Archive and is now available at https://mskb.pkisolutions.com. It will also be accessible via our site navigation under PKI Blog & Resources/MS KB Archive Search.
Article statistics as of January 29, 2020
|Total articles from Microsoft||49,430|
|Active articles on Microsoft.com||48,646|
|Archived articles in our MSKB||784|
The idea behind this is to get access back to valuable information stored in the past on Microsoft Knowledge Base. Microsoft is consistently performing a massive cleanup of their websites, including Downloads, TechNet and MSDN Blogs, Knowledge Base. Microsoft deleted a large number of technical whitepapers from their websites and I even had to published an archive collection of the PKI-related whitepapers.
In 2018, Microsoft started TechNet and MSDN blogs retirement without any announcement. In March 2019, I published a solution to save a local copy of not yet deleted blogs. After a large number of complaints in social networks Microsoft Blogs were restored as read-only archive. It took more than one year for Microsoft to react and restore access to blogs.
Similarly, Microsoft is actively removing content from Knowledge Base. We understand that Microsoft removes KB articles for retired products such as Windows Server 2003 since they are no longer supported. The process seems to make sense, except KB articles contain information that is related not only to retired products, but also often applies to newer versions as well. This includes technology description, tip & tricks, workarounds to issues and limitations, design decisions, etc. Especially, for Windows Server 2003/Windows XP products.
Unfortunately, Microsoft articles aren’t reviewed and updated to reflect newer applicable OSes. An example from list of items in the ADCS Hotfix Digest: The Issuer Statement Specified in the Capolicy.inf File Is Not Included in the Issued Certificate (KB283789). The article applies to all OS versions starting from Windows 2000 Server to Windows Server 2019 (the most recent at the moment of writing this post). You find 404 error page. The only alternative to get contents is to use web archive. But wayback machine is a last resort and can be very tricky. Sometimes, you get error pages (from Microsoft), sometimes in wrong language and broken layout. It requires some efforts to get the right page.
Path To Solution
In November 2019, I started research to find possible APIs to get the list of current articles (that exist or existed on Microsoft Support website), list of products, whatever else what I can use to retrieve data. I found an API that can retrieve articles by exact number, but no APIs to get a plain list of available articles. I’ve engaged several people (Vadim Sterkin, Vasily Gusev, Alexander Sukhovey and others) to do an exhaustive search to get the list. In total we made over 5 million requests using PowerShell and retrieved about 10GB of data (including HTML and attachments). Here are some numbers:
In total, we got 105 thousands of files. I did a deep analysis of downloaded data, grouped, filtered, shaped data and packed into SQL database. Current database size (as of January 2020) is about 5.6GB and 4GB of attachments to articles.
As a front-end we chose a simple ASP.NET Core website project using standard BootStrap 3.3.7 template.
Our intent and goal is to only provide an archive for deleted content from the Microsoft KB site. As a result, we will be tracking articles that are created on a regular basis. Once the scanning engine detects an article has been removed from the Microsoft site, it will be available to view in our tool. We will NOT be displaying any information that is currently available on the Microsoft website. We have no desire to compete with information sharing or traffic to the Microsoft site.
- When entering website, you are redirected to search page, where you can search for articles by ID number, words in article title or description.
- You can browse for all product names and a list of fetched articles for given product by visiting KB Products page. Currently, we show all titles, but provide access to offline articles only. We will improve this behavior in future.
- List all offline articles. At the bottom of site, press Offline Articles link and get the list of all publicly available articles and links to each article.
- The database is automatically updated each month. Automatic update system checks existing articles for updates, adds new articles to database and checks for retired articles. Once KB article is retired from Microsoft, it will be added to public access on our archive.
This is our initial release. We are planning the following enhancements:
- Add more offline articles from wayback machine. Every 2-3 years, Microsoft change their KB website layout and markup that makes it more difficult to reliable parse source HTML for every version. Fortunately, I’ve managed to write parsers for V1-V4 (from around 2007 till 2015) versions that contain most retired articles. Thus, I can automate article import to public access from web archive.
- Provide email subscription functionality. In future, we will provide an ability to subscribe for updates in our database. You will be able to opt the products you are interested in or get updates for all products.
We are open to your feedback. If you find that something doesn’t work as intended, or you have suggestions, enhancement request, please contact us via contact form: https://www.pkisolutions.com/contact/