Archiving the Intenet the German Way…

This past week the German Bundestag (parliament) published a law (it was passed in 2006, but doesn’t go into effect until it is published in its final form) that mandated all German websites needed to deliver a copy of all digital content (text, photos, sound, and any other multimedia content) to the National Library in Leipzig, which is the German equivalent to the U.S. Library of Congress. German companies protested the law throughout the legislative process, arguing that it set an undue burden on companies to comply and would result in enormous financial costs. Not only is the law itself interesting – mandating the state to archive the internet – but also how it is going to preserve the internet content. The library has asked that all of the website content be submitted in one of two formats – either as a PDF file or if the content stretches over multiple pages, such as a multi-page HTML website, then the content should be submitted using the ZIP compression format containing all of the related files. This last bit alone raises so many questions – such as which files need to be included and how often do companies need to resubmit their content – every time the website is updated?

One exception to the law is content that is generated by private citizens for private use. But, this raises a whole other set of questions, such as what is “private” on the Internet, a space that is by design “public.” Pundits have been quick to point to the gray area of Weblogs – are they private or are they public. If companies are supposed to archive all of the content, then what happens to “private” blogs that are hosted by for-profit companies like Blogger or even Facebook? Theoretically, companies that don’t comply will be served a letter of warning followed by a fine of up to 10,000 Euros for each act of non-compliance. At the moment, the National Library has issued a statement that it is not going to enforce the statute until it has been able to fully assess its ability to store all of the data that will be flowing its way.

In a related story out of Europe this week – the European Union has decided to take on the Google Book Project by digitizing the contents of Europe’s largest libraries, museums, archives, and film studios – placing this content online. The first incarnation of “Europeana” should be launching November 20th. The European Commission, which is coordinating, but not actually carrying out the project, hopes that the new website will become a clearing house for access to European civilization. Those are lofty goals indeed and to compete with Google might be an even loftier one, but it will be a wonderful (free) resource for those of us living and teaching outside of Europe. At the same time, as per the EU’s goals, the digitization of the cultural objects will also serve to preserve them in a digital age. For more on this topic, see this English Language article from Der Spiegel.

Both of these two recent examples highlight several of the themes that were addressed in this week’s readings. As all three of the readings allude, finding a digital medium that can hold its own over time is probably going to be the greatest challenge for digital archivists. I see a great hurdle being created here by the German National Library – submitting one’s site via PDF or a ZIP file is probably only a temporary fix and does not actually ensure any sort of preservation. With the PDF format, the library is basically asking the owners of the sites to “print” out their site and submit this copy to the library in what is at the moment the ubiquitous e-book or e-paper format. However, will it remain so in the near and distant future? Already there are competing formats, most of which actually rely on less sophisticated coding – using ASCI text and a style sheet instead of embedded formatting. The ZIP file format seems even more controversial – who is going to guarantee that what is submitted can actually be accessed? So much of the rich multimedia content on the web is dependent on specific server-side technologies that would make a stand-alone version relatively useless. Instead, maybe the National Library should consult with the people at the Internet Archive about what might be better ways to archive the content that the library desires…

The Europeana project sounds more feasible, as it aims not to digitize everything out there, but gather together the various digitization projects in Europe and place them under one roof for easy access and cross-referencing. Europeana is in effect attempting to build a multimedia encyclopedia out of the content that has been or is being created in Europe. One of the criticisms raised in the Rosenzweig article was that archivists complain that they cannot archive everything and that someone (i.e. historians) need to help in the process of determining what should be preserved and what can be discarded. In some ways, Europeana is performing exactly this function (at least partially) – it is selecting those aspects of European culture that have been deemed the most important for inclusion, which in turn will guide other archivists and curators to gather more content in order to further enhance the collection as it grows over time.

The immense job of a digital archivist is far from enviable, especially when important digital historical documents have been willfully or even purposefully deleted. One of my favorite Bloggers is Dan Froomkin of the Washington Post. He writes a Blog called White House Watch, which analyzes not just the White House but also the White House Press Corps. Starting in April 2007, Froomkin wrote a series of posts concerned with the deletion of White House emails and how they could impact future historical accounts of the Bush White House. The first article in the series is here and is worth a read.

4 Replies to “Archiving the Intenet the German Way…”

  1. John,

    The examples that you cite really do a great job of tying theoretical issues to real life. I definitely admire Germany for making some sort of effort to preserve their digital history, but I agree with you that their methods are far from perfect. In a certain sense, I think that, like Rosensweig said in “Scarity or Abundance,” in order to reach any of our preservation goals at all, we need to remember that imperfection is okay when it comes to digital preservation. It seems like so many would-be digital archivists and historians get stuck on that part.

    We should concentrate on what preserving digital records will have in common with preserving print material to keep our goals acheivable. I think this is where the people in Germany go off track. Just because there is no way to preserve the entire Internet is no reason to go with a stupid option like archiving things in a proprietary medium. I think that the Internet Archive is much closer to having a solid preservation model, though in their case they don’t have the institutional stability that the German national archive has.

    I think the key here is remembering the past. We tend to think of the Internet as a very dynamic medium, and it is, but not necessarily more so than any other type of information medium. Correspondence involves multiple people, some of whom may be missing in a given archival collection, and newspaper articles are written in a certain climate and culture that, for the most part, cannot be retrieved. I think that we need to acknowledge that the same is true for digital material before we can go about setting up good preservation goals that are limited enough to be realistic.

    That’s the overly practical archivist side of me talking though!

  2. Hi Jon,

    This policy certainly does raise more questions than it answers! In particular, the question of whether companies that host private blogs will be required to comply with the law is one that really stands out to me. While it is obviously true that blogs are by nature a public medium, if I was told that the government would be holding a copy of my blog in their archives for eternity I would take it down in an instant! I suspect that most individuals create blogs knowing that random people will find them and read them, but also thinking that we retain control over the ability to erase them at any time (or perhaps not really thinking about it at all).

    The Europeana collection is interesting, as you mentioned, because it performs the “weeding out” function that archivists have been seeking in digital preservation. However, when considering these two projects together (archiving all German websites and creating a European cultural archive) the one thing that strikes me is that, ostensibly, neither of these projects addresses how to preserve some of the more mundane uses of the internet (email, blogging, etc.) by the ordinary person. If we’re going to have a complete archival record, we can’t focus solely on the profound and prominent.

  3. ‘Morning!

    Interesting. I wonder if the German law includes any provision for “culling out” duplicate or redundant files. If I update my company page, but make no substantive changes (maybe fixing the grammar here or there), does that make it different enough to count as “evidence of a change over time” and worth keeping the two different copies?

    We seem also to have opened up an “expectation of privacy” issue as well. Most physical archive holdings (and Kate can correct me if I’m mistaken) have gone through at least one or two vetting processes by the time they become available to the public– either by the donor or thier agents, and again by the accessioning archivist, to insure that ethical considerations concerning personal privacy are taken into account. But, as both you an Laura have pointed out, if a private company must (under threat of penalty) submit copies of the content of their customers, we’ve lost that control. Granted, people don’t seem to consider that (or we wouldn’t have endless stories about old Facebook photos causing problems during post-collegiate job searches).

    .zip files? I might have an old zipdrive around somewhere, if they’d like to take it off my hands…

  4. Hi Jon,

    My biggest question here would be just how different a government-run large archive like the German competitor to Google Books be to our corporate-run, business oriented companies. It is a relief to know that this has been sanctioned by the European Union, and I am wondering just how influential this is in other countries, particularly the United States. Knowing that there are countries willing to open up their digital borders gives us a larger landscape, and also alters the way historians search for their archival content. The government also becomes more aware of the enormity of their collections and what gets placed online. In this sense, historians will have to assess the amount of access given to them by these political powers, and what content is missing and why it is missing.

    On top of all of this, issues of standardization come into play again, even across international borders. Is the EU using the same equipment and technology in order to preserve it’s materials? It’s great that they’re using the web to store web-based items. Can the internet be used as a medium for full access? Is there such a thing as a more advanced internet that cannot be translated at some future point in time?

    (Also, in response to Bill’s comment, I haven’t used a .zip drive in years, but I do get .zip files pretty commonly. It just seems like the computer has been able to move on without this particular piece of hardware, but finds the software still useful.)

Comments are closed.