TAGS: CHECKLIST // INTEGRATION // SYSTEMS

Hard Lessons from Systems Integration Work

Not every technical failure needs a long post-mortem. A lot of them are basic oversights that repeat in every integration job. This is my personal checklist from independent systems integration work.

1) The ICD is law

I once tried to reverse engineer a flight plan from post-flight hex logs without an ICD. That sounded doable… until I realized I was guessing offsets and data types with no way to confirm anything.

If there is no ICD, get the live stream. Capture real traffic while someone is actually clicking buttons and sending commands. Static logs alone can waste days.

This is because one wrong byte shifts everything after it, and then you are decoding garbage forever.

2) The struct padding trap

Sending a raw C++ struct over the network is one of those mistakes that works “on my machine” and fails everywhere else.

Compilers add padding bytes. The receiver does not know that. Result: fields appear shifted and values look random.

If you really need a packed layout, use explicit packing rules, or serialize into a byte buffer in a controlled way. Do not rely on the compiler layout.

3) Endianness mismatches

Most modern PCs are little-endian, but not everything you connect to is. Multi-byte integers can arrive reversed if you skip conversion.

When I suspect this, I stop trusting decimals and look at raw hex. I compare what I sent vs what arrived, then apply the correct byte swap or conversion for the target system.

This is because different CPUs store the most significant byte at different addresses.

4) Shared memory race conditions

Shared memory is fast, but it can bite you hard. If one process reads while the other is mid-write, you can end up with torn data. Half old, half new.

If the data matters, protect it. Use a named mutex or a strict pattern that guarantees a clean snapshot.

This is because writing a large structure is not a single “instant” action.

5) The monkey test

If you only test the happy path, users will finish the rest of the testing for you.

Double clicks, rapid spam, out-of-order actions, clicking start twice, stop before start. These inputs happen on site every time.

Try to break your own UI and state machine on purpose. If it survives your abuse, it will survive real users.

6) Silent firewall interference

UDP can work perfectly in your lab and die on site for a stupid reason: the machine switched to “Public network” and the OS starts dropping traffic quietly.

First thing I check now is the network profile and firewall rules. I don’t wait until I’m deep in Wireshark and blaming the other system.

7) Incorrect network routing

I’ve seen packets go to the wrong interface because Windows decided WiFi is “better” than the isolated sim Ethernet.

If you have multiple NICs, don’t leave routing to luck. Bind your socket to the exact interface IP you want.

Operating systems use route metrics, and the interface with a gateway often wins.

8) Time synchronization

Clock drift is not a theory. It breaks real systems.

Some systems will reject data if timestamps look “from the past” or too far off. The packets are fine, the data is fine, but the time is wrong.

Sync the machines. Use NTP. One reference clock, everyone follows it.

9) Descriptive bug reporting

“It doesn’t work” is not a bug report. It is a complaint.

I always ask for three things: what was the action, what happened, and what was expected. Without that, you can’t reproduce, and without reproduce you can’t confirm the fix.

10) Configuration versioning

A small config change can break the whole system, and then you forget what you touched.

Before changing anything, copy the config files and keep a recovery point. Do it every time, even if you feel confident.

This is because early “fixes” often introduce new variables and you lose the baseline.

11) Persistent documentation

I solved the same issue twice before, just because I didn’t write it down the first time.

Now I keep a technical log. Short notes are enough: symptoms, root cause, exact fix. Nothing fancy.

Technical knowledge is volatile. If you don’t record it, it disappears.

12) The physical layer

Hours can be wasted debugging software when the real issue is a cable, a loose connector, or the wrong port.

Check Layer 1 early. Power, link lights, cables, switch ports. It feels too simple, but it saves the most time.

Hardware disconnects can look exactly like software failure.