“Poor Man’s” CMS From The Ground Up — Part 1, Planning And Defines
--
In a previous article I talked about how going back to assembly language changed how I think about solving problems. I already look at things quite differently from the industry as a whole, but it was interesting to see how many allegedly “good practices” I’d adopted that may have just been time wasters.
So I’m starting over simple, building a “straight line” IPO (Input, Process, Output) “poor man’s” CMS that is little more than glue between content and template, providing some basic extras like “one index to rule them all”, auto minification and gzip of the output buffer, and results caching.
And as I’m always saying the best place to start is to actually bother planning things and setting up your infrastructure before you lay out even a line of code.
As I said in that other article, I’m going to use modern PHP since node.js frustrations sent me running and screaming back into its gentle arms.
Basic Logic Flow
I laid this out in the previous article, but let’s review:
- Retrieve site configuration data
- Send HTTP headers common to all pages
- define() paths, files, and the user request extracted from REQUEST_URI
- see if request exists, if not 404
- see if cached result exists, if so send it
- if no cached result, build it.
For the first iteration we’ll skip the caching and gzip models as those are easy enough to implement where/as needed.
Directory Structure
Organizing files is one of the most important — and oft neglected — parts of building a new system. The way I like to handle things is to try and ensure that everything points “down-tree”. Aka I should not have to do any “../” in any file paths.
docs/directories.txt
/
Root, holds favicons, index.php, and that's it.
/cache
location of static copies -- raw and gzip -- of pages
/config
files containing configuration data. Should all end in .config.php
/docs
text files explaining the system... like this one.
/downloads
Where files the user can download are hosted
/errors
.config.php and .content.php files for building error pages.
/images
Content images
/includes
Files that contain static code to be executed directly. Should be ".inc.php"
/libs
Library files containing functions and objects, ".lib.php"
/pages
contains .config.php and .content.php files holding page content.
/templates
unique template directories
/templates/default
the default template.
/template/*templateName*
location of .template.php and various CSS files for a specific template
/template/*templateName*/fonts
webfonts specific to a template
/template/*templateName*/images
images specific to a template such as backgrounds
Naming Conventions
One thing I frequently hear programmers kvetch about is the “difficulty” of naming things. I have and continue to find that to be utter nonsense, but what do I know? I still say that most people saying that about CSS classes are wasting time naming things that their markup should already name for them.
There’s also the fact that by planning not just what you’re doing, but what you might need later can save you headaches in the long run. It’s why I want to applaud BEM for establishing a naming convention, despite my despising what it is actually used for. Aka taking a steaming dump on the markup adding classes where there should be none.
Naming gets broken into two parts, code and filenames.
Functions, Objects, Variables And Defines
For variables I love to use prefixes for anything that goes into the global scope so I know what it’s from/for. For the define — which I’m using a LOT — this is more important since they are inherently global.
The format I try to use is:
where_whatItIs
Aka one word to describe where it’s from or what it’s for, an underscore, and then a camelcase (or all caps for define) description of what it is.
Take my define, which will have the following prefixes:
definePrefixes.doc.txt
CACHE_
Caching files and locations
FILE_
relating to the local filesystem path
HTML_
Information used to build the markup, such as the character encoding,
site title, etc.
HTTP_
involves the HTTP request and related paths.
PAGE_
specific to the current page
SITE_
information about the site and its configuration
TEMPLATE_
The currently loaded site template
Common terms for the latter half of these variables help too. Such as “DIR” for directory, CONFIG for a .config.php file, “PAGE” for page content.
Filenames
I like to use a “multiple extension” format to quickly identify what the files do and are for. Thus:
fileExtensions.doc.txt
.config.php
A configuration file creating/loading data and setting DEFINE
GENERAL LOCATIONS:
/config, /errors, /pages, /pages/*subsection*,
/template/*templateName*
.content.php
the information of a page to be plugged into the template as the
page content. PHP allowed.
GENERAL LOCATIONS:
/errors, /pages, /pages/*subsection*
.doc.txt
A text file containing documentation of this system.
GENERAL LOCATIONS:
/docs
.inc.php
A file that is blindly included. Typically contains no functions
or objects.
GENERAL LOCATIONS:
/errors, /includes
.lib.php
Library file, holds functions and/or classes.
GENERAL LOCATIONS:
/libs
.print.css
stylesheets for print media
GENERAL LOCATIONS:
/template/*templateName*
.screen.css
stylesheets specific to screen media
GENERAL LOCATIONS:
/template/*templateName*
.template.php
A template file. The "body.template.php" is a static file where
markup is allowed directly. Any other template files should wrap
their templates in functions.
GENERAL LOCATIONS:
/template/*templateName*
Preplanning this type of stuff might feel like a waste of time, and certainly as production goes on it will need to be expanded or modified, but after some four and a half decades of writing software the time spent here and now will save a metric ton of headaches later on. I cannot emphasize how often projects sink before they even start because simple stuff like this isn’t planned, whilst everyone is sitting around circle-jerking over their “tech stack”, version control, pre-processing, and a host of other stuff that has little to do with actually getting work done.
And the Laugh Is
The above actually does most of the “hard thinking” for us. From here it’s just code that we’ve actually pre-planned the roadmap of!
So let’s start building the code.
One Index To Rule Them All
This simple concept has a host of different ways it can be implemented. The one I prefer is the whitelist approach, where you list off a bunch of file extensions the server can send normally. Any file not matching those criteria you send to the index.
For Apache the .htaccess or http.conf looks a little something like this:
RewriteEngine On
RewriteRule !(^(images|downloads))|(\.(gif|jpg|png|css|js|html|txt|ico|zip|rar|pdf|xml|mp4|mpg|flv|swf|mkv|ogg|avi|woff|woff2|svg|eot|ttf)$) index.php
Basically anything in /images or /downloads, or any file who’s extension is in that list behaves normally. Anything and everything else gets rerouted to the index.php. This includes all subdirectory request.
The first file I make however is not the index.php. Instead I build the site configuration file.
<?php
/*
site.config.php
Poor Man's CMS
Jason M. Knight
4 Feb 2023
*/
define('HTML_NOCACHE', true); // uncomment for debugging
define('HTML_NOMINIFY', true); // uncomment for debugging
define('HTML_CHARSET', 'utf-8');
define('SITE_TITLE', "Poor Man's CMS Demo");
The first two HTML values are to disable our caching and minification for debugging purposes. Some pages might also need to disable these things — I’ll be using a very simple minify regex so you might want to turn it off for pages with <pre>formatted code. The character set and site title also get set here. Simple basic information that people setting up this system or using it for their own projects can change without screwing with our index.php
Then in the index.php, we can start out setting up a slew of define.
<?php
/*
index.php
"Poor Man's" CMS
Jason M. Knight
04 Feb 2023
*/
include 'config/site.config.php';
header('Content-Type:text/html;charset=' . HTML_CHARSET);
header('X-Frame-Options:DENY'); // prevent clickjacking
define('FILE_ROOT', dirname(__FILE__) . DIRECTORY_SEPARATOR);
define('HTTP_ROOT', pathinfo($_SERVER['PHP_SELF'], PATHINFO_DIRNAME));
define('HTTP_BASE', pathinfo(
substr($_SERVER['REQUEST_URI'], strlen(HTTP_ROOT) ),
PATHINFO_FILENAME
) ?: 'index');
Load the config, send some http headers to set things like clickjack prevention and the character encoding. Remember that whatever is set in the HTTP headers for charset will override anything you declare in the markup.
FILE_ROOT is the the directory of the index.php. I save this as if you try to do fopen / file_(get|put)_contents, or other such operations a lot of servers do NOT make the current directory the one of the executing file, but of the host directory. For example if our index.php was in:
/var/www/site001/web/testing
Don’t be shocked if you try to write to “cache/test.cache.raw” it actually puts the file at “/var/www/site001/cache/test.cache.raw” instead of the “/var/www/site001/web/testing/cache/test.cache.raw” you actually want.
There are other ways to extract this on the fly, but it’s far simpler at startup to just dump it into a define.
HTTP_ROOT is important if you have relative pathing headaches, but also is what we need to extract off the start of our REQUEST_URI to make sure we have any pathed request correct, or to detect if it’s empty.
It’s messed up but lets say we’re running the CMS in testing/index.php, and we access “testing/”. Instead of an empty result in basename, we get “testing” as the basename. We don’t want that. Thus we strip HTTP_ROOT off the beginning of REQUEST_URI before we perform a basename extraction.
Next we can set up some PAGE_ related define. There are three I think I’ll need/want.
PAGE_BASE — The base request filename, used to find the [pages|errors]/PAGE_BASE[.config.php|.content.php|handler.php] files, as well as to create the cache/TEMPLATE_NAME-PAGE_BASE.cache.[raw|gz] files.
PAGE_PATH — the path and filename prefix (PAGE_BASE) to grab those .config.php and .content.php files from
PAGE_TYPE — is it a “content” page to be raw included off of a full path, or a “handler” in the root of /pages that processes the remainder of the URI.
A couple of simple functions help create this. I broke these out into functions because they are used more than once.
function define_error($name) {
define('PAGE_BASE', 'errors-' . $name);
define('PAGE_PATH', 'errors/' . $name);
define('PAGE_TYPE', 'content');
} // define_error
function define_request($base, $type) {
$prefix = 'pages/' . $base;
if (file_exists($prefix . '.' . $type . '.php')) {
define('PAGE_BASE', $base);
define('PAGE_PATH', $prefix);
define('PAGE_TYPE', $type);
} else define_error('404');
} // Index::define_request
Then we have to break up the request.
if (
// whitespace and unreserved characters only
preg_match('/[^\w\-\+\~\.]/', HTTP_BASE) ||
// as hyphens are our slashes for pathing,
// leading/trailing not allowed
(HTTP_BASE !== trim(HTTP_BASE, '-')) ||
// if using the handler delimiter, it's only allowed once
(count($split = explode("~", HTTP_BASE)) > 2)
) {
define_error('invalidURI');
} else if (count($split) > 1) {
define('PAGE_REQUEST', $split);
define_request($split[0], 'handler');
} else {
define_request(str_replace('-', '/', HTTP_BASE), 'content');
}
I added extra whitespace for clarity. For namespace safety I say that if our HTTP_BASE contains anything other than a-zA-Z0–9_+-~ we reject the request as invalid. (a 400 error). I build the $split array based on the presence of ~ and if there’s more than two results, that too is invalid.
If however there are two results we define the requests as $split[0] and type handler, and save the split to PAGE_REQUEST. Otherwise we assume that i’s a pathed request.
Guess I need to explain that. Parsing “pathed” requests like “reference/about” sucks for a host of reasons. As the URL would literally be a path you have the headaches of portability on your include paths in the markup. You end up playing games with <basename>, using absolute URI’s for everything, or all sorts of other trickery.
Instead of all that, why not just use letter replacement, like hyphens for pahting. Thus for example:
/docs == /pages/docs.content.php
/docs-about == /pages/docs/about.content.php
Were able to organize files into subdirectories on the server whilst having none of the pathing headaches client-side. But what about if we want to have a file that handles the request separately, such as in the future when/if we add a database. Just use one character different. The tilde is good as it’s a “unreserved” character few if any other systems use.
thus:
/search~docs-httpHeaderFooter == /pages/search.handler.php
where "docs-httpHeaderFooter" is in PAGE_REQUEST
For static files the former format lets you organize into subdirectories. For non-static you can access the handler.
Next we set up the template stuff.
/*
when user database is implemented, setup user data here
so you can figure out what template is being used.
*/
define('TEMPLATE_NAME', 'default');
define('TEMPLATE_DIR', 'templates/' . TEMPLATE_NAME . '/');
For now we just have the default template. You can change that here, but if you add user choice later on — well, I commented that at this point you would be expected to load the user’s preferences by this point.
Next set up the rest of our basic define.
define('PAGE_CONFIG', PAGE_PATH . '.config.php');
define('PAGE_CONTENT', PAGE_PATH . '.' . PAGE_TYPE . '.php');
define('CACHE_RAW', 'cache/' . TEMPLATE_NAME . '_' . PAGE_BASE . '.raw');
define('CACHE_GZ', 'cache/' . TEMPLATE_NAME . '_' . PAGE_BASE . '.gz');
What to include for before the template (PAGE_CONFIG), inside the template (PAGE_CONTENT) and the names of the caching files.
Then we determine if the UA supports content encoding — aka gzip compression. We set the appropriate header for gzipped content if present, and set a define to match that.
foreach (['gzip', 'x-gzip', 'x-compress'] as $type) {
if (strpos($_SERVER['HTTP_ACCEPT_ENCODING'], $type) !== false) {
define('HTTP_GZIPTYPE', $type);
header('Content-Encoding:' . $type);
break;
}
}
For the caching, if CACHE_RAW exists, we can load the cache routines.
if (
!defined('HTML_NOCACHE') &&
file_exists(CACHE_RAW)
) include 'includes/checkCache.inc.php';
Or at least if caching isn’t disabled. CheckCache.inc.php will die() after outputting the cache if it exists and hasn’t expired. Otherwise it will return so we can…
include 'libs/outputBuffer.lib.php';
include 'includes/buildPage.inc.php';
Start up output buffering and build the page. Which is where index.php ends off.
The reason “buildpage” isn’t part of index.php, or that the cache isn’t is that I avoid loading code we might not run. If caching works there’s no reason for the page building and output buffering to even be loaded, and vice-versa!
Now, some might say I jump the gun writing the caching output routine before I even have output cached, but given we have the file naming convention in place, there is no reason to not go ahead and just do it.
<?php
/*
caching.lib.php
Jason M. Knight
4 Feb 2023
If caching is present and the cached copy is older than source,
send the appriate zipped or unzipped content and die.
If the caching is out of date, we return letting normal page-load
go through.
*/
$srcTime = filemtime(PAGE_CONTENT);
if (file_exists(PAGE_CONFIG)) {
$srcTime = max($srcTime, filemtime(PAGE_CONFIG));
}
if (filemtime(CACHE_RAW) > $srcTime) {
readfile(defined('HTTP_ENCODING') ? CACHE_GZ : CACHE_RAW);
die();
}
Not exactly complex. We pull the time of the content and the optional config files, if they’re older than the cache file send the appropriate one and die. Otherwise execution continues and the normal page is sent.
Now the real magic is the output buffer… where I’m basically borrowing from graphics pipelines and triple-buffering.
<?php
/*
outputBuffer.lib.php
Jason M. Knight,
January 2023
Automagically handles gzip compression without server
config trickery, by using output buffering and a
shutdown function you can header() until blue in the
face anywhere in the code, so long as this is your
first include.
*/
ob_start(); // output buffer for compress or caching.
ob_implicit_flush(0); // no flushing on content output
I start by turning output buffering on. Notice that I do not use the gzip handler here because I want both raw and gzipped content to be saved for caching. I then have a static object for handling our own separate buffering:
final class Outputbuffer {
private static $buffer = '';
private static function add($markup, $noMinify = false) {
self::$buffer .= (
$noMinify || defined('HTML_NOMINIFY')
) ? $markup : self::minify($markup);
}
public static function minify($markup) {
return preg_replace(
[
'/[^\S ]+\</s', // whitespace before closing tags
'/\>[^\S ]+/s', // whitespace after ending tags
'/(\s)+/s', // long whitespace sequences
'/<!--(.|\s)*?-->/' // comments
], [
'<', '>', '\\1', ''
],
$markup
);
} // Outputbuffer::minify
public static function end($noMinify = false) {
self::add(ob_get_clean(), $noMinify);
return self::$buffer;
} // Outputbuffer::end
public static function push($noMinify = false) {
self::add(ob_get_contents(), $noMinify);
ob_clean();
} // Outputbuffer::push
} // Outputbuffer
self::$buffer is an internal string we dump PHP’s output buffer into for minification when the ::add routine is called. Typically you would do this via the “push” method. We also have our minification handler which I went ahead and made public since it might be useful elsewhere.
The difference between ::push and ::end being if they empty and close the normal output buffer, and ::end returning the contents of ::$buffer. Thus why the bits they do share in common are handled by the private ::add method.
The minify script used is very basic, but reliable. It does not however trap things like <pre> tags, and thus that “push” method comes into play.
For example:
<?php OutputBuffer::push(); ?>
<pre>
This is some preformatted
indented text
</pre>
<?php OutputBuffer::push(true); ?>
The first ::push() will send everything already in the PHP buffer to the ::$buffer minified if minification is enabled, emptying PHP’s output buffer. The second one passsing (true) will not minify the contents of the PHP buffer before adding it to ::$buffer, thus preserving our formatting.
We could write a more complex or tricky regex or add all sorts of logic to handle it automagically, but honestly this is just as simple an answer and gives you a degree of control over when and where minification occurs.
Some of you might recoil in horror at the static class… others will applaud it but ask why it’s not a “proper singleton”. The latter of you actually feeding why the former exists.
I like static objects, they are a great way to encapsulate multiple separate scopes in one file. I also find the more granular control just cleaner than “namespaces”. What I don’t get is what in the hell adding all that “getinstance” crap does apart from make it hard for people to do a “new” on it. Solution there? Don’t do a “new staticClass”, not throw more code at it for Christmas only knows what. All that extra garbage people throw at “singletons” and dicking around with “instances” are what gives them a bad name!
Next up — and the last thing I’m going to cover in this article — is the real magic of the outputBuffer, how it handles exits.
register_shutdown_function(function() {
$contents = Outputbuffer::end();
$compressed = "\x1f\x8b\x08\x00\x00\x00\x00\x00" .
substr(gzcompress($contents, 6), 0, -4);
if (
!defined('HTML_NOCACHE') &&
(error_get_last() == null)
) {
file_put_contents(LOCAL_ROOT . CACHE_RAW, $contents);
file_put_contents(FILE_ROOT . CACHE_GZ, $compressed);
}
echo defined('HTTP_GZIPTYPE') ? $compressed : $contents;
} );
Pull the output (minified or otherwise) from the secondary buffer, make a compressed copy. If caching isn’t disabled write the cache so it will be used next time, though naturally we don’t save if there was a PHP error. Either way output the appropriate version… and done.
The nice part is this is in a shutdown function, so if we start outputting gzipped content and an error occurs, or we want to randomly die() anywhere in our code, this function will still be run. Thus instead of “encoding errors” we still get useful output.
Ok, I’m going to stop here so I can write the actual code that handles output. So far so good, maybe 20 minutes of coding and two hours of writing about the code.
We have our plan, we have our foundation “brickwork”, now it’s time to start erecting beams and studs.