What's the fastest way to split dictionary keys into a string-type tuples and append another string to last items in the tuples?


Given a dictionary of string key and integer values, what's the fastest way to

  1. split each key into a string-type key tuple
  2. then append a special substring </w> to the last item in the tuple

Given:

counter = {'The': 6149,
     'Project': 205,
     'Gutenberg': 78,
     'EBook': 5,
     'of': 39169,
     'Adventures': 2,
     'Sherlock': 95,
     'Holmes': 198,
     'by': 6384,
     'Sir': 30,
     'Arthur': 18,
     'Conan': 3,
     'Doyle': 2,}

The goal is to achieve:

counter = {('T', 'h', 'e</w>'): 6149,
 ('P', 'r', 'o', 'j', 'e', 'c', 't</w>'): 205,
 ('G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g</w>'): 78,
 ('E', 'B', 'o', 'o', 'k</w>'): 5,
 ('o', 'f</w>'): 39169,
 ('A', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'e', 's</w>'): 2,
 ('S', 'h', 'e', 'r', 'l', 'o', 'c', 'k</w>'): 95,
 ('H', 'o', 'l', 'm', 'e', 's</w>'): 198,
 ('b', 'y</w>'): 6384,
 ('S', 'i', 'r</w>'): 30,
 ('A', 'r', 't', 'h', 'u', 'r</w>'): 18,
 ('C', 'o', 'n', 'a', 'n</w>'): 3,
 ('D', 'o', 'y', 'l', 'e</w>'): 2,}

One way to do it is to

I've tried

{(tuple(k[:-1])+(k[-1]+'</w>',) ,v) for k,v in counter.items()}

In more verbose form:

new_counter = {}
for k, v in counter.items():
    left = tuple(k[:-1])
    right = tuple(k[-1]+'w',)
    new_k = (left + right,)
    new_counter[new_k] = v

Is there a better way to do this?

Regarding the adding tuple and casting it to an outer tuple. Why is this allowed? Isn't tuple supposed to be immutable?


Answer

Or use str.split, and do str.join and '</w>' adding beforehand:

>>> counter = {'The': 6149,
     'Project': 205,
     'Gutenberg': 78,
     'EBook': 5,
     'of': 39169,
     'Adventures': 2,
     'Sherlock': 95,
     'Holmes': 198,
     'by': 6384,
     'Sir': 30,
     'Arthur': 18,
     'Conan': 3,
     'Doyle': 2,}
>>> {tuple((' '.join(k)+'</w>').split()):v for k,v in counter.items()}
{('T', 'h', 'e</w>'): 6149, ('P', 'r', 'o', 'j', 'e', 'c', 't</w>'): 205, ('G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g</w>'): 78, ('E', 'B', 'o', 'o', 'k</w>'): 5, ('o', 'f</w>'): 39169, ('A', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'e', 's</w>'): 2, ('S', 'h', 'e', 'r', 'l', 'o', 'c', 'k</w>'): 95, ('H', 'o', 'l', 'm', 'e', 's</w>'): 198, ('b', 'y</w>'): 6384, ('S', 'i', 'r</w>'): 30, ('A', 'r', 't', 'h', 'u', 'r</w>'): 18, ('C', 'o', 'n', 'a', 'n</w>'): 3, ('D', 'o', 'y', 'l', 'e</w>'): 2}
>>> 

Timings:

import timeit
print('bro-grammer:',timeit.timeit(lambda: [{(*a[:-1],f'a[-1]</w>',):b for a,b in counter.items()} for i in range(1000)],number=10))
print('Sandeep Kadapa:',timeit.timeit(lambda: [{tuple(key[:-1])+(key[-1]+'</w>',):value for key,value in counter.items()} for i in range(1000)],number=10))
print('U9-Forward:',timeit.timeit(lambda: [{tuple((' '.join(k)+'</w>').split()):v for k,v in counter.items()} for i in range(1000)],number=10))

Output:

bro-grammer: 0.1293355557653911
Sandeep Kadapa: 0.20885866344797197
U9-Forward: 0.3026948357193003