Sunday, September 11, 2011

Semantic Mark-up and implied context

So today's entry is not at all Python-oriented (except in that one of my longer-term projects is a Content Management System [CMS] that I'll get to writing in Python). It occurred to me while driving around today doing various errands that there's a potential fundamental disconnect in how the CMS at work is being used, at least with respect to how it's intended to be used. In a nutshell, the CMS in question (SDL Tridion) is a system for the creation and management of content, specifically for content being published to any number of web-pages. That might feel like a "duh" moment, but there are some subtle distinctions between that sort of approach and other systems whose goal is to manage pages with content in them, rather than content that is associated with pages.

Let me explain.

Tridion's approach is to have content (defined as structured XML, with formal associated schemas, etc.). With the exception of "rich text" fields, which can themselves contain HTML markup, fields are simple values, with no tags or code around them other than the markup of the XML structure that defines them. Those content-items ("components" in Tridion-ese :-) ) are processed by Component Templates (CTs), that have any number of processing steps (Template Building Blocks or TBBs) into a Component Presentation (CP) containing the full, presumably-client-ready HTML markup. Pages, then, are constructed by assembling lists of CPs that a Page Template (PT) and its associated TBBs determine how those individual CPs are added to a page when the page is generated or published.

An alternate approach, and one that seems to be fairly common with other CMS', is to have a page layout that is roughly equivalent to a Tridion PT, where regions on the page can contain any (or sometimes any of a limited set) of content-items. There may be other approaches as well, and most of the "other" CMS' seem to support custom content-types, though few of them seem to have out-of-the-box support for them that doesn't involve some set-up or at least some customization of the page templates.

Since the focus for Tridion is around building pages that contain CPs, there's a context implied (if not outright required) by the collaboration of Page and Component templating that allows virtually complete freedom in the generation of CT and PT structures and interaction. It also allows for a fair amoount of logic to be put in place at the templating level to ensure that the implied context isn't violated. The "other" CMS' can provide a similar context for custom content-types, approaching it from the direction of ensuring that every content-item, regardlss of where it might reside on a page, is contextually self-contained.

If you're following (and care about this subject at all), you might be wondering why it matters. A fair question.

It matters because the expectation (at least for some of my co-workers) is more in line with the non-Tridion approach than with the Tridion approach. Their expectation seems to be that when a chunk of content is retrieved from the CMS (a process that is happening outside the CMS itself, contained in application pages that are essentially going to pull content from the CMS as needed), the context for that content will be there. But really, a Component or CP that is, for example, populating a <li> element in the page has no way of "knowing" that it will be presented inside an <ol> or <ul> element. Since business users will be managing the content and pages (that's why you get a CMS, after all!), this will inevitably lead to bugs, and bugs that may very well go unnoticed for weeks or months. It won't even take anything significant - a simple miscommunication between business users/stakeholders and developers.

Granted, at some level, the same can be said for the same CP as it's being assembled onto a page using a PT, but at least in that case, the context can be supplied or enforced (and would be, if I had my way) at the PT level. Outside that CMS environment, the application has to supply that context. If the context needs to change, for whatever reason, that requires a change to the application while an identical change managed solely in the CMS would require a less-expensive templating change.

Not that either approach is better or worse, per se... but given that we've spent the time and money to get Tridion operational (probably close to a half million dollars, if I remember the base costs correctly, and assuming that all the staff working with it are being paid even close to what I'd expect based on industry standards), it seems like a terrible waste of time and money. After all, we could've downloaded something like Drupal or Wordpress and converted it entirely to the company-preferred language (.NET) in less time, and probably with a lot less effort.

As to why this matters...? Well, there's been some discussion of ensuring "semantic" markup (whatever exactly that means in the context it's being used). With the implication that it's "semanticness" needs to be independent of the use that the content is being put to. So we want to have contextually-relevant output without a context. Which means that we have to infer a context, or mandate one.

That feels frighteningly fragile to me...

Friday, September 9, 2011

The BaseFilesystemItemCache abstract class

See My Coding Style for explanation of, well, my coding style...

BaseFilesystemItemCache is a nominal abstract class, providing basic caching capabilities for the DBFSpy stack. The intent is to provide an in-memory cache of the filesystem, so that file-retrieval doesn't have to wait on the back-end database, while allowing the individual file-objects themselves (IFilesystemItem-derived instances) to handle keeping track of their cache-age.
Cache (Property):
The object's IFilesystemItem-object cache
Guids (Property):
A list of all the Guid properties of the items in the cache.
Paths (Property):
A list of all of the Path properties of the items in the cache (calculated, so less efficient than using the Guids)
Add (Method):
Adds an IFilesystemItem to the cache.
_BuildCache (Method):
Builds the object's Cache from a supplied list of items.
GetItem (Method):
Returns a specific item from the cache, specified by either Guid or Path
GuidOfPath (Method):
Returns the Guid corresponding to the specified path.
PathOfGuid (Method):
Returns the Path corresponding to the specified Guid
Remove (Abstract Method):
Removes an item from the cache

class BaseFilesystemItemCache( object ):
    """Nominal abstract class, provides functional requirements for objects that can cache virtual filesystem items."""

    ###########################
    # Class Attributes        #
    ###########################

    ###########################
    # Class Property Getters  #
    ###########################

    def _GetCache( self ):
        """Gets the IFilesystemItem cache."""
        return self._cache

    def _GetCacheLifetime( self ):
        """Gets or sets the duration that items in the cache will be cached for."""
        return self._cacheDuration

    def _GetGuids( self ):
        """Gets a list of viable cached item-GUIDs."""
        return self._cache.keys()

    def _GetPaths( self ):
        """Gets a list of viable cached item-paths."""
        result = []
        for guid in self._cache.keys():
            result.append( self._cache[ guid ].Path )
        result.sort()
        return result

    ###########################
    # Class Property Setters  #
    ###########################

    def _SetCacheDuration( self, value ):
        if type( value ) not in [ types.IntType, types.LongType, types.FloatType ]:
            raise TypeError( '%s.CacheDuration error: Expected a numeric (float, int or long) value. %s is not valid.' % ( self.__class__.__name__, value ) )
        self._cacheDuration = value

    ###########################
    # Class Property Deleters #
    ###########################

    ###########################
    # Class Properties        #
    ###########################

    Cache = Property( _GetCache, None, None, _GetCache.__doc__ )
    Guids = Property( _GetGuids, None, None, _GetGuids.__doc__ )
    Paths = Property( _GetPaths, None, None, _GetPaths.__doc__ )

    ###########################
    # Object Constructor      #
    ###########################

    def __init__( self ):
        """Object constructor."""
        self._cache = {}

    ###########################
    # Object Destructor       #
    ###########################

    ###########################
    # Class Methods           #
    ###########################

    def Add( self, item ):
        """Adds an item to the cache."""
        if not isinstance( item, IFilesystemItem ):
            raise TypeError( '%s.Add error: Expected an instance implementing IFilesystemItem, %s does not' % ( self._-class__.__name__, item ) )
        self._cache[ item.Guid ] = item

    def _BuildCache( self, cacheItems ):
        """Builds the cache from a supplied list of IFilesystemItem instances."""
        if not type( cacheItems ) in [ types.ListType, types.TupleType ]:
            raise TypeError( '%s.BuildCache error: Expected a list or tuple of IFilesystemItem instances.' % ( self.__class__.__name__ ) )
        self._cache = {}
        try:
            for item in cacheItems:
                self.Add( item )
        except( TypeError, error ):
            raise TypeError( '%s.BuildCache error: Expected a list or tuple of IFilesystemItem instances. %s' % ( self.__class__.__name__, error ) )
        except:
            raise

    def GetItem( self, guidOrPath ):
        """Returns an item from the cache (attempting to get it into the cache if necessary first) using the GUID or Path of the item to identify it."""
        item = None
        # Check to see if the guidOrPath is a GUID formatted string:
        if len( guidOrPath ) == 36 and GUIDCHECKER.sub( '', guidOrPath ) == '':
            # It's a GUID
            item = self._cache[ guidOrPath ]
        else:
            # Assume it's a path
            guid = self.GuidOfPath( guidOrPath )
            if guid != None:
                item = self._cache[ guid ]
        # Check to see if the item needs to be refreshed from the cache:
        if item != None and time.time() - item.Atime > self._cacheDuration:
            item.Refresh()
        return item

    def GuidOfPath( self, path ):
        """Returns the GUID of the cache-item at a specified path, or None if no GUID is available."""
        for guid in self._cache.keys():
            if self._cache[ guid ].Path == path
            return guid
        return None

    def PathOfGuid( self, guid ):
        """Returns the Path of the cache-item idenified by the supplied GUID, or None if no Path is available."""
        if item.Guid in self.Guids:
            return self._cache[ guid ].Path
        return None

    def Remove( self, item ):
        """Removes an item from the cache."""
        if item.Guid in self.Guids:
            del self._cache[ item.Guid ]

    ###########################
    # Static Class Methods    #
    ###########################

    pass
__all__ += [ 'BaseFilesystemItemCache' ]

Commentary

The existence of the BaseFilesystemItemCache abstract class is predicated on the desire to have the ability to cache filesystem items. Whether that proves to be useful or not will have to wait until the final executable is complete and can be run, but my expectation (at this time, at any rate) is that it will be beneficial, particularly in high-load use-cases. At the same time, any time caching gets involved, there are risks that have to be managed: latency of the cache, making sure it's current, etc., ad astra. The intention in DBFSpy is to spread the load out, such that the presence or absence of a filesystem item is managed at the BaseFilesystemItemCache level, while the caching of the actual data for the item is managed by the items themselves (which will implement IFilesystemItem).

Also, in an effort to assure that the identifiers in the database are of a reasonable size, their IDs will be GUIDs - but at the same time, those GUIDs are not terribly useful outside of that database context - the native filesystem relies on paths. The Paths attribute is intended to provide those paths on demand, but I'm not confident (yet) that the mechanism I've chosen will perform as well as I want it to. The alternative, however, would potentially require more convoluted (and potentially fragile) code, to allow an item's path to also be a cache-key - complete with the ability for items to be altered or deleted from the cache by either the GUID or the path...