Linux 4.0 and live patching

Av Marcus Folkesson den 2 Mars 2015

Linus has released -rc1 for Linux 4.0 and it contains several interesting features.
One of the most top headlines is live patching, which simple let you to patch a running system without miss uptime at all. SUSE and Red Hat has developed their respecive variant on the kernel patching mechanism, independent (and unknowing) of each other. The companies has agreed to base their work on a common API, which is what will takes place in v4.0. Then we will see what really becomes implemented in upcoming versions of the kernel.

Interesting feature indeed, but maybe it is just as Arjan van de Ven say on the mailing list:

Now, live patching sounds great as ideal, but it may end up being (mostly) similar like hardware hotplug: Everyone wants it, but nobody wants to use it.

MMAP memory between kernel- and userspace

Av Marcus Folkesson den 21 Januari 2015

MMAP memory between kernel- and userspace

Allocate memory in kernel and let userspace map is sounds like an easy task, and sure it is. There are just a few things that is good to know about page mapping.
The MMU (Memory Management Unit) contains page tables with entries for mapping between virtual and physical addresses. These pages is the smallest unit that the MMU deals with.
The size of a page is given by the PAGE_SIZE macro in asm/page.h ans is typically 4k for most architectures.

There is a few more useful macros in asm/page.h:

PAGE_SHIFT: How many steps we should shift to left to get a PAGE_SIZE
PAGE_SIZE: Size of a page, defined as (1 << PAGE_SHIFT).
PAGE_ALIGN(len): Will round up the length to the closest alignment of PAGE_SIZE.

How does mmap(2) work?

Every page table entry has a bit that tells us if the entry is valid in supervisor mode (kernel mode) only. And sure, all memory allocated in kernel space will have this bit set.
What the mmap system call do is simply creating a new page table entry with a different virtual address that points to the same physical memory page. The difference is that this supervisor-bit is not set.
This let userspace access the memory as if it was a part of the application, for now it is!
The kernel is not involved in those accesses at all, so it is really fast.
Magic? Kind of.
The magic is called remap_pfn_range().
What remap_pfn_range() do is just essentially to update the processor’s specific page table with these new entries.

Example please

Allocate memory
As said before, the smallest unit that the MMU handle is the size of PAGE_SIZE and the mmap(2) only works with full pages. Even if you just want to share only 100 bytes, a whole page frame will be remapped and must therefor be allocated in the kernel.

The allocated memory must also be page aligned.

__get_free_pages()
One way to allocate pages is with __get_free_pages().

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

The gft_mask is commonly set to GFP_KERNEL in process/kernel context and GFP_ATOMIC in interrupt context. The order is the number of pages to allocate expressed in 2^order.
For example:

u8 *vbuf = __get_free_pages(GFP_KERNEL, size >> PAGE_SHIFT);

Allocated memory is freed with __free_pages().

vmalloc()

A more common (and preferred) way to allocate virtual continuous memory is with vmalloc().
vmalloc() will allways allocate whole set of pages, no matter what. This is exactly what we want!

Read about vmalloc() in kmalloc(9):

allocates size bytes, and returns a pointer to the allocated memory. size becomes page aligned by vmalloc(), so the smallest allocated amount is 4kB. The allocated pages are mapped to the virtual memory space behind the 1:1 mapped physical memory in the kernel space. Behind every vmalloc’ed area there is at least one unmapped page. So writing behind the end of a vmalloc’ed area will not result in a system crash, but in a segmentation violation in the kernel space. Because memory fragmentation isn’t a big problem for vmalloc(), vmalloc() should be used for huge amounts of memory.

Allocated memory is freed with vfree().

alloc_page()

If you need only one page, alloc_page() will give you that.
If this is the case, insead of using remap_pfn_range(), vm_insert_page() will do the work you for you.
Notice that vm_insert_page() apparently only works on order-0 (single-page) allocation. So if you want to allocate N pages, you will hace to call vm_insert_page() N times.

Now some code

Allocation

priv->a_size = ATTRIBUTE_N * ATTRIBUTE_SIZE;
/* page align */
priv->a_size = PAGE_ALIGN(priv->a_size);
priv->a_area =vmalloc(priv->a_size);

file_operations.mmap

static int scan_mmap (struct file *file, struct vm_area_struct *vma)
{
struct mmap_priv *priv = file->private_data;
unsigned long start = vma->vm_start;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long page;
size_t size = vma->vm_end – vma->vm_start;

if (size > priv->a_size)
return -EINVAL;

page = vmalloc_to_pfn((void *)priv->a_area);
if (remap_pfn_range(vma, start, page, priv->a_size, PAGE_SHARED))
return -EAGAIN;

vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */
return 0;
}

Juleavslutning i Combitech AS

Av Linn Eikeland den 22 December 2014

God jul og et godt nytt år

Årets juleavslutning i Stavanger ble i god tradisjon et stemningsfylt førjulsverksted fylt av små og store. Med båt fra Oljemuseet bar det over til Natvigs Minde, der alver og nisser ventet på oss i julenissens verksted. Alle våre 22 nisseungene satte flittig i gang med laging av julegodteri, kremmerhus, glitrende tegninger og ikke minst snekring av fuglekasser. Med risgrøt, saft, julefortelling og julemusikk falt virkelig julestemningen over oss.

Juleavslutningen til Oslokontoret var i skikkelig Bayescamp-ånd og ble avholdt på Skjennungstua. Skjennungstua er et populært turmål i Nordmarka i Oslo, så for å komme frem ventet en frisk tur gjennom skogen.

Heldigvis er vi Combitechere litt føre-var og sendte Jan inn i skogen for å sjekke forholdene på forhånd. Beskjeden var klar om at veien inn var dekket av is og alle måtte ha på seg brodder (helst typen stegjern).

Ved ankomst til Skjennungstua var peisen tent og vi fikk servert nybakte boller og varm gløgg. Bare bollene gjorde turen verdt strevet. Til middag fikk vi servert herlig suppe og kalkun.

Etter at en god middag var inntatt var det tid for å spasere hjem igjen. Da vi kom ut dørene oppdaget vi til vår glede at snøen lavet ned. Turen gjennom skog opplyst av snø ga oss en magisk julestemning.

Vi i Combitech ønsker med dette alle en god jul og et godt nytt år!

TFS och feature-brancher

Av Mats Sjövall den 11 December 2014

I mitt nuvarande projekt kör vi GIT tillsammans med TFS samt en stor mängd automatisk tester. Eftersom TFS inte har stöd för Gated-Checkins tillsammans med GIT så har vi satt upp automatiska byggen av feature-brancher (vi kör branch strategin Gitflow Workflow). För en utvecklare funkar det helt enkelt så att man skapar en ny branch (t ex feature/mycoolfeature) och publicerar den på servern varpå ett bygge automatiskt kickar igång. Är bygget grönt, dvs alla tester går igenom så kan man säkert merga in i Dev-branchen utan att breaka Dev-branchen.

Hur sätter man upp detta då?

Börja med att kopiera din befintliga byggdefinition (detta för att inte blanda feature-branch-byggen och Dev-branch-byggen). Tyvärr funkar inte clone-funktionen i TFS Power Tools med byggen som använder Git så man får använda Community TFS Build Manager VS2013 (om man inte orkar klicka i alla inställningar manuellt).

Se till att trigger är ”Continuous Integration”.

Gå in på fliken Source Settings och ändra monitored branch till ”refs/head/feature/*”. Det är allt som behövs. När en ny branch som heter nåt som börjar på ”feature/” publiceras eller om nån checkar in på branchen så kommer den att trigga ett bygge.


Tyvärr ser man inte branch-namnen man byggt ifrån i listan över byggen så ibland kan det vara svårt att hitta utan man får klicka sig ner i varje bygge för att se vad den byggt ifrån (det går att lösa genom att modifierar Build Process Templaten så byggnamnet innehåller branchnamnet, men det är lite mer komplicerat).

Memory management in the kernel – part 1

Av Marcus Folkesson den 28 November 2014

Memory management in the kernel

Memory management is  among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on. All these parts *has* to work perfect (or at least allmost perfect :-) ) because all parts of the system use them either they want to or not. If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between. I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit, described in a upcoming post) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.
Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h.
That is a lot of pages. If we do a simple calculation:
We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages. Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do?
It does a lot of housekeeping,  lets look at a few set of members that I think is most interresting:
struct page {
unsigned long flags;
unsigned long private;
void *virtual;
atomic_t _count;
pgoff_t index;
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
#endif

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in  include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock. In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it’s enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process’s address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page.

This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures. However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario. Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed?
It is enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:
#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)

#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))

#define ALLOC_SPLIT_PTLOCKS (SPINLOCK_SIZE > BITS_PER_LONG/8)

The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access.

If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to use that increases the size of a spinlock and there is no problem. Exemple when  sizeof spinlock does not fit is when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding *free-functions) should be called in *every* place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you hace to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.
Remember that the page-ptl should *never* be accessed directly, use appropriate helper functions for that.

Example on such helper functions is:
pte_offset_map_lock()
pte_unmap_lock()
pte_alloc_map_lock()
pte_lockptr()
pmd_lock()
pmd_lockptr()