Parsing BODS error logs

Sanitise Data Services error logs to make them actionable.

— Binary Adventures

Parsing BODS error logs

Purpose

Due to frequent errors with BODS, we need to process all error logs available, and gather as much information possible for root cause analysis.

Extract of an error log:

(14.2) 03-14-20 03:02:31 (R) (70620:78068) DBS-070401: |Data flow DF_ORDERS|Loader Query_SA_ORDERS
                                                       ODBC data source <DATABASE\INSTANCE4> error message for operation <SQLExecute>: <[Microsoft][SQL Server Native Client
                                                       11.0][SQL Server]Arithmetic overflow error converting numeric to data type numeric.
                                                       [Microsoft][SQL Server Native Client 11.0][SQL Server]The statement has been terminated.>.
(14.2) 03-14-20 03:02:31 (E) (70620:78068) DBS-070401: |Data flow DF_ORDERS|Loader Query_SA_ORDERS
                                                       ODBC data source <DATABASE\INSTANCE1> error message for operation <SQLExecute>: <*****>. This message contains some
                                                       internal system details which have been hidden for security. If you need to see the full contents of the original message, ask
                                                       your administrator to assign additional privileges to your account.

The actual message is wrapped over several lines, always indented with about 55 spaces (it tends to fluctuate, for no apparent reason).

There are about 3.400 files to process, with a combined size of 36GB. The individual file sizes range from 0 KB to 1 or more GB.

Approach

To make the logs machine readable, we need to normalise the formatting first, and remove the whitespace indents before wrapped lines. To clean up and process our log files, we’re going to use regular expressions.

To find the line wraps, we’ll use the following RE: \n\s{50,}. This will detect a newline followed by 50 or more spaces. We’ll replace this with a single space, so our lines are now continuous, and each line in the log is an actual error record.

To extract the information from each line, the RE below will be used (with named groups, though not strictly necessary, it makes the RE arguably easier to read).

^\((?:[\d\.]+)\)\s(?P<date_time>\d{2}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s\((?P<msg_type>[WEX])\)\s\([\d:]+\)\s(?P<msg_code>\w{3}-\d+):\s(?P<message>.*)$

The flattened text (with line wraps removed) is temporarily stored in a StringIO buffer. After processing 100 files (which doesn’t guarantee an equal amount of data of course), the text is converted to a DataFrame using the RE, and the timestamp is converted to a native datetime datatype.

Converting each file to a DataFrame and appending it to a single DataFrame or first collecting all text and then converting it to a DataFrame, would result in subpar performance and excessive resource usage.

To keep memory use down, the msg_type and msg_code columns (as they have a very low cardinality) are converted to a Categorical datatype.

import re
import io
import logging
from pathlib import Path

import pandas as pd

# Remove wrapped lines
patt_remove_wraps = re.compile(r'\n\s{50,}')

patt_record = re.compile(r'^\((?:[\d\.]+)\)\s(?P<date_time>\d{2}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s\((?P<msg_type>[WEX])\)\s\([\d:]+\)\s(?P<msg_code>\w{3}-\d+):\s(?P<message>.*)$',
                         re.MULTILINE)

# Columns extracted by the regex
column_list = ['timestamp', 'msg_type', 'msg_code', 'message']

def extract_records_from_text(raw_text):

    df = pd.DataFrame([record.groups() for record in patt_record.finditer(raw_text)],
                      columns=column_list)
    df['timestamp'] = pd.to_datetime(df.timestamp, format="%m-%d-%y %H:%M:%S")
    return df


flattened = io.StringIO()

base_folder = Path('/Logging/BODS')
df = pd.DataFrame([], columns=column_list)

for i, error_file in enumerate(base_folder.glob('**/error_*.txt'),1):

    if i % 100 == 0:
        # Every 100 iterations, we convert the text to
        # structured records in a DataFrame
        df = df.append(extract_records_from_text(flattened.getvalue()))
        # Close the io stream and open a new one
        flattened.close()
        flattened = io.StringIO()

        # Concatenate results in loss of category column
        df['msg_type'] = df['msg_type'].astype('category')
        df['msg_code'] = df['msg_code'].astype('category')

        print(i, end='', flush=True)

    with open(error_file) as f:
        raw_text = f.read()

    text = patt_remove_wraps.sub(' ', raw_text).strip()

    if len(text) > 0:
        flattened.write(text + '\n')

# Append the last lines of text
df = df.append(extract_records_from_text(flattened.getvalue()))
flattened.close()
df['msg_type'] = df['msg_type'].astype('category')
df['msg_code'] = df['msg_code'].astype('category')

logging.info('Done!')

The resulting DataFrame can either be persisted to disk (ca. 9.41GB):

df.to_pickle('df_errors.pickle')

or the data can be transferred to a SQLite database (using SQLAlchemy, ca 10GB):

# Transfer the data from DataFrame to SQLite
from sqlalchemy import create_engine

engine = create_engine('sqlite:///errors.sqlite', echo=False,)
conn = engine.raw_connection()

df.to_sql('errors', 
          con=conn, 
          index=False, 
          if_exists='replace')

conn.close()

Keep in mind that the RE to convert the error records purposely dropped records with a msg_type of R, which is why the resulting data volume is smaller than that of the source files combined.